empty tablet direcoties on HDFS

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

empty tablet direcoties on HDFS

Massimilian Mattetti
Hi all,

I created a table with 3 initial split-points ( I used a sharding mechanism to evenly distribute the data between them) and started ingesting data using the batch writer API. At the end of the ingestion process I got around 1.01K tablets (threshold for splitting was set to 1GB)  for a total of 600GB of space on HDFS ( measured using the command hadoop fs -du -h on the table directory). Digging into the table directory on HDFS I noticed that there are around 700 tablets (directory starting with t-) that are empty, other 300 tablets that have around 1GB or less of data and 3 tablets (default_tablet included) containing 130 GB of data each one.  Is this a normal behavior? (I am working with a cluster of 3 servers running Accumulo 1.8.1).

I ran also another experiment importing the same data on a different table that was configured in the same way of the previous one, but this time using the bulk import. Eventually for this table I did not have empty tablets although most of them contains few MBs, and the final space on HDFS was around 450GB. What can be the reason for such big difference on the space on disk between the batch writer API and bulk import?
Thanks.

Best Regards,
Max
Reply | Threaded
Open this post in threaded view
|

Re: empty tablet direcoties on HDFS

dlmarion

Does the data get distributed if you compact the table?

On May 23, 2017 at 5:04 AM Massimilian Mattetti <[hidden email]> wrote:

Hi all,

I created a table with 3 initial split-points ( I used a sharding mechanism to evenly distribute the data between them) and started ingesting data using the batch writer API. At the end of the ingestion process I got around 1.01K tablets (threshold for splitting was set to 1GB)  for a total of 600GB of space on HDFS ( measured using the command hadoop fs -du -h on the table directory). Digging into the table directory on HDFS I noticed that there are around 700 tablets (directory starting with t-) that are empty, other 300 tablets that have around 1GB or less of data and 3 tablets (default_tablet included) containing 130 GB of data each one.  Is this a normal behavior? (I am working with a cluster of 3 servers running Accumulo 1.8.1).

I ran also another experiment importing the same data on a different table that was configured in the same way of the previous one, but this time using the bulk import. Eventually for this table I did not have empty tablets although most of them contains few MBs, and the final space on HDFS was around 450GB. What can be the reason for such big difference on the space on disk between the batch writer API and bulk import?
Thanks.

Best Regards,
Max
Reply | Threaded
Open this post in threaded view
|

Re: empty tablet direcoties on HDFS

Massimilian Mattetti
Sorry Dave, but I don't get what you mean by "get distributed". Running a compaction from the shell will create one file per tablet, there is no data repartitioning involved in this process.




From:        Dave Marion <[hidden email]>
To:        [hidden email]
Date:        23/05/2017 15:10
Subject:        Re: empty tablet direcoties on HDFS




Does the data get distributed if you compact the table?

On May 23, 2017 at 5:04 AM Massimilian Mattetti <[hidden email]> wrote:

Hi all,


I created a table with 3 initial split-points ( I used a sharding mechanism to evenly distribute the data between them) and started ingesting data using the batch writer API. At the end of the ingestion process I got around 1.01K tablets (threshold for splitting was set to 1GB)  for a total of 600GB of space on HDFS ( measured using the command hadoop fs -du -h on the table directory). Digging into the table directory on HDFS I noticed that there are around 700 tablets (directory starting with t-) that are empty, other 300 tablets that have around 1GB or less of data and 3 tablets (default_tablet included) containing 130 GB of data each one.  Is this a normal behavior? (I am working with a cluster of 3 servers running Accumulo 1.8.1).


I ran also another experiment importing the same data on a different table that was configured in the same way of the previous one, but this time using the bulk import. Eventually for this table I did not have empty tablets although most of them contains few MBs, and the final space on HDFS was around 450GB. What can be the reason for such big difference on the space on disk between the batch writer API and bulk import?
Thanks.


Best Regards,
Max


Reply | Threaded
Open this post in threaded view
|

Re: empty tablet direcoties on HDFS

Michael Wall
Was your cluster with the batch writer done spitting and moving data?  That is a lot of splits that got generated.  When a tablet is split, the files are inspected and potentially assigned to both new tablets.  Compacting that range will rewrite the data into files for each tablet so rfiles contain only data for their range.  Dave is suggesting a compaction for that reason, as it will re"distribute" the data in the rfiles.  Eventually, it should get to the same state you saw with the bulk import test.  For the batch writer test and the 3 tablets with all the data, what does inspecting the rfile show you about the range of data in those?

Did you create 3 splits when you bulk imported?  Or did you create the 1.01 splits?

On Tue, May 23, 2017 at 8:51 AM Massimilian Mattetti <[hidden email]> wrote:
Sorry Dave, but I don't get what you mean by "get distributed". Running a compaction from the shell will create one file per tablet, there is no data repartitioning involved in this process.




From:        Dave Marion <[hidden email]>
To:        [hidden email]
Date:        23/05/2017 15:10
Subject:        Re: empty tablet direcoties on HDFS




Does the data get distributed if you compact the table?

On May 23, 2017 at 5:04 AM Massimilian Mattetti <[hidden email]> wrote:

Hi all,


I created a table with 3 initial split-points ( I used a sharding mechanism to evenly distribute the data between them) and started ingesting data using the batch writer API. At the end of the ingestion process I got around 1.01K tablets (threshold for splitting was set to 1GB)  for a total of 600GB of space on HDFS ( measured using the command hadoop fs -du -h on the table directory). Digging into the table directory on HDFS I noticed that there are around 700 tablets (directory starting with t-) that are empty, other 300 tablets that have around 1GB or less of data and 3 tablets (default_tablet included) containing 130 GB of data each one.  Is this a normal behavior? (I am working with a cluster of 3 servers running Accumulo 1.8.1).


I ran also another experiment importing the same data on a different table that was configured in the same way of the previous one, but this time using the bulk import. Eventually for this table I did not have empty tablets although most of them contains few MBs, and the final space on HDFS was around 450GB. What can be the reason for such big difference on the space on disk between the batch writer API and bulk import?
Thanks.


Best Regards,
Max


Reply | Threaded
Open this post in threaded view
|

Re: empty tablet direcoties on HDFS

Massimilian Mattetti
I have just started a compaction, it will take a while to complete.

"When a tablet is split, the files are inspected and potentially assigned to both new tablets"
I thought about this, so in my case the 3 directories containing 130GB of data (divided among 500 files) are actually holding the data of all the other tablets whose directory are empty. Am I right?

"what does inspecting the rfile show you about the range of data in those?"
I am using accumulo rfile-infoto inspect the file but it does not tell me if the file is shared or not among different tablets. Am I missing something?

"Did you create 3 splits when you bulk imported?  Or did you create the 1.01 splits?"
I crated 3 splits for the bulk ingestion too. A separated file is written and then imported for each split point.

Thanks.

Max



From:        Michael Wall <[hidden email]>
To:        [hidden email]
Date:        23/05/2017 17:20
Subject:        Re: empty tablet direcoties on HDFS




Was your cluster with the batch writer done spitting and moving data?  That is a lot of splits that got generated.  When a tablet is split, the files are inspected and potentially assigned to both new tablets.  Compacting that range will rewrite the data into files for each tablet so rfiles contain only data for their range.  Dave is suggesting a compaction for that reason, as it will re"distribute" the data in the rfiles.  Eventually, it should get to the same state you saw with the bulk import test.  For the batch writer test and the 3 tablets with all the data, what does inspecting the rfile show you about the range of data in those?

Did you create 3 splits when you bulk imported?  Or did you create the 1.01 splits?

On Tue, May 23, 2017 at 8:51 AM Massimilian Mattetti <MASSIMIL@...> wrote:
Sorry Dave, but I don't get what you mean by "get distributed". Running a compaction from the shell will create one file per tablet, there is no data repartitioning involved in this process.




From:        
Dave Marion <dlmarion@...>
To:        
user@...
Date:        
23/05/2017 15:10
Subject:        
Re: empty tablet direcoties on HDFS




Does the data get distributed if you compact the table?

On May 23, 2017 at 5:04 AM Massimilian Mattetti <MASSIMIL@...> wrote:

Hi all,

I created a table with 3 initial split-points ( I used a sharding mechanism to evenly distribute the data between them) and started ingesting data using the batch writer API. At the end of the ingestion process I got around 1.01K tablets (threshold for splitting was set to 1GB)  for a total of 600GB of space on HDFS ( measured using the command hadoop fs -du -h on the table directory). Digging into the table directory on HDFS I noticed that there are around 700 tablets (directory starting with t-) that are empty, other 300 tablets that have around 1GB or less of data and 3 tablets (default_tablet included) containing 130 GB of data each one.  Is this a normal behavior? (I am working with a cluster of 3 servers running Accumulo 1.8.1).

I ran also another experiment importing the same data on a different table that was configured in the same way of the previous one, but this time using the bulk import. Eventually for this table I did not have empty tablets although most of them contains few MBs, and the final space on HDFS was around 450GB. What can be the reason for such big difference on the space on disk between the batch writer API and bulk import?
Thanks.

Best Regards,
Max


Reply | Threaded
Open this post in threaded view
|

Re: empty tablet direcoties on HDFS

Michael Wall
- I thought about this, so in my case the 3 directories containing 130GB of data (divided among 500 files) are actually holding the data of all the other tablets whose directory are empty. Am I right?

Could be, but you would have to inspect the rfiles and the accumulo metadata table.  When a split happens, the files are not initially moved in hdfs, simply assigned to the tablet.

I am using accumulo rfile-infoto inspect the file but it does not tell me if the file is shared or not among different tablets. Am I missing something?

That is correct.  From there you will get a range of data for the file, start to end.  If it covers 1 split, then the file should only be hosted by 1 tablet.  If that range contains more than 1 split, it could be hosted by 1 or more tablets.  To check for a file belonging to more than 1 tabletserver, scan the accumulo.metadata table looking for the 'file' CF and that file name.

I crated 3 splits for the bulk ingestion too. A separated file is written and then imported for each split point.

Doing it this way, each of the 3 tablets got data and then started splitting in parallel based on you 1G split threshold.  To get the highest throughput, use bulk, presplit your table and write rfiles for each split.

HTH

Mike

On Tue, May 23, 2017 at 10:54 AM Massimilian Mattetti <[hidden email]> wrote:
I have just started a compaction, it will take a while to complete.

"When a tablet is split, the files are inspected and potentially assigned to both new tablets"
I thought about this, so in my case the 3 directories containing 130GB of data (divided among 500 files) are actually holding the data of all the other tablets whose directory are empty. Am I right?

"what does inspecting the rfile show you about the range of data in those?"
I am using accumulo rfile-infoto inspect the file but it does not tell me if the file is shared or not among different tablets. Am I missing something?

"Did you create 3 splits when you bulk imported?  Or did you create the 1.01 splits?"
I crated 3 splits for the bulk ingestion too. A separated file is written and then imported for each split point.

Thanks.

Max



From:        Michael Wall <[hidden email]>
To:        [hidden email]
Date:        23/05/2017 17:20
Subject:        Re: empty tablet direcoties on HDFS




Was your cluster with the batch writer done spitting and moving data?  That is a lot of splits that got generated.  When a tablet is split, the files are inspected and potentially assigned to both new tablets.  Compacting that range will rewrite the data into files for each tablet so rfiles contain only data for their range.  Dave is suggesting a compaction for that reason, as it will re"distribute" the data in the rfiles.  Eventually, it should get to the same state you saw with the bulk import test.  For the batch writer test and the 3 tablets with all the data, what does inspecting the rfile show you about the range of data in those?

Did you create 3 splits when you bulk imported?  Or did you create the 1.01 splits?

On Tue, May 23, 2017 at 8:51 AM Massimilian Mattetti <[hidden email]> wrote:
Sorry Dave, but I don't get what you mean by "get distributed". Running a compaction from the shell will create one file per tablet, there is no data repartitioning involved in this process.




From:        
Dave Marion <[hidden email]>
To:        
[hidden email]
Date:        
23/05/2017 15:10
Subject:        
Re: empty tablet direcoties on HDFS




Does the data get distributed if you compact the table?

On May 23, 2017 at 5:04 AM Massimilian Mattetti <[hidden email]> wrote:

Hi all,

I created a table with 3 initial split-points ( I used a sharding mechanism to evenly distribute the data between them) and started ingesting data using the batch writer API. At the end of the ingestion process I got around 1.01K tablets (threshold for splitting was set to 1GB)  for a total of 600GB of space on HDFS ( measured using the command hadoop fs -du -h on the table directory). Digging into the table directory on HDFS I noticed that there are around 700 tablets (directory starting with t-) that are empty, other 300 tablets that have around 1GB or less of data and 3 tablets (default_tablet included) containing 130 GB of data each one.  Is this a normal behavior? (I am working with a cluster of 3 servers running Accumulo 1.8.1).

I ran also another experiment importing the same data on a different table that was configured in the same way of the previous one, but this time using the bulk import. Eventually for this table I did not have empty tablets although most of them contains few MBs, and the final space on HDFS was around 450GB. What can be the reason for such big difference on the space on disk between the batch writer API and bulk import?
Thanks.

Best Regards,
Max


Reply | Threaded
Open this post in threaded view
|

Re: empty tablet direcoties on HDFS

Massimilian Mattetti
The compaction completed and the data is now evenly distributed among all the tablets. There are still few tablets that do not contain data but all the others contain an average of 0.5 GB. Something weird happened to the total number of Entries that I can see from the monitor web UI. It looks like 2 Billions of entries are gone. What does it mean?

Is there a way to make sure that Accumulo evenly distributed the data among different RFile instead of only using the Metadata table to track the split points? I have the impression that the previous data distribution can cause performance penalty during querying.

Thanks.

Max





From:        Michael Wall <[hidden email]>
To:        [hidden email]
Date:        23/05/2017 18:04
Subject:        Re: empty tablet direcoties on HDFS




- I thought about this, so in my case the 3 directories containing 130GB of data (divided among 500 files) are actually holding the data of all the other tablets whose directory are empty. Am I right?

Could be, but you would have to inspect the rfiles and the accumulo metadata table.  When a split happens, the files are not initially moved in hdfs, simply assigned to the tablet.

- I am using accumulo rfile-infoto inspect the file but it does not tell me if the file is shared or not among different tablets. Am I missing something?

That is correct.  From there you will get a range of data for the file, start to end.  If it covers 1 split, then the file should only be hosted by 1 tablet.  If that range contains more than 1 split, it could be hosted by 1 or more tablets.  To check for a file belonging to more than 1 tabletserver, scan the accumulo.metadata table looking for the 'file' CF and that file name.

I crated 3 splits for the bulk ingestion too. A separated file is written and then imported for each split point.

Doing it this way, each of the 3 tablets got data and then started splitting in parallel based on you 1G split threshold.  To get the highest throughput, use bulk, presplit your table and write rfiles for each split.

HTH

Mike

On Tue, May 23, 2017 at 10:54 AM Massimilian Mattetti <MASSIMIL@...> wrote:
I have just started a compaction, it will take a while to complete.

"When a tablet is split, the files are inspected and potentially assigned to both new tablets"

I thought about this, so in my case the 3 directories containing 130GB of data (divided among 500 files) are actually holding the data of all the other tablets whose directory are empty. Am I right?


"
what does inspecting the rfile show you about the range of data in those?"
I am using accumulo rfile-infoto inspect the file but it does not tell me if the file is shared or not among different tablets. Am I missing something?


"
Did you create 3 splits when you bulk imported?  Or did you create the 1.01 splits?"
I crated 3 splits for the bulk ingestion too. A separated file is written and then imported for each split point.


Thanks.


Max




From:        
Michael Wall <mjwall@...>
To:        
user@...
Date:        
23/05/2017 17:20
Subject:        
Re: empty tablet direcoties on HDFS




Was your cluster with the batch writer done spitting and moving data?  That is a lot of splits that got generated.  When a tablet is split, the files are inspected and potentially assigned to both new tablets.  Compacting that range will rewrite the data into files for each tablet so rfiles contain only data for their range.  Dave is suggesting a compaction for that reason, as it will re"distribute" the data in the rfiles.  Eventually, it should get to the same state you saw with the bulk import test.  For the batch writer test and the 3 tablets with all the data, what does inspecting the rfile show you about the range of data in those?

Did you create 3 splits when you bulk imported?  Or did you create the 1.01 splits?

On Tue, May 23, 2017 at 8:51 AM Massimilian Mattetti <
MASSIMIL@...> wrote:
Sorry Dave, but I don't get what you mean by "get distributed". Running a compaction from the shell will create one file per tablet, there is no data repartitioning involved in this process.





From:        
Dave Marion <dlmarion@...>
To:        
user@...
Date:        
23/05/2017 15:10
Subject:        
Re: empty tablet direcoties on HDFS




Does the data get distributed if you compact the table?
On May 23, 2017 at 5:04 AM Massimilian Mattetti <
MASSIMIL@...> wrote:

Hi all,

I created a table with 3 initial split-points ( I used a sharding mechanism to evenly distribute the data between them) and started ingesting data using the batch writer API. At the end of the ingestion process I got around 1.01K tablets (threshold for splitting was set to 1GB)  for a total of 600GB of space on HDFS ( measured using the command hadoop fs -du -h on the table directory). Digging into the table directory on HDFS I noticed that there are around 700 tablets (directory starting with t-) that are empty, other 300 tablets that have around 1GB or less of data and 3 tablets (default_tablet included) containing 130 GB of data each one.  Is this a normal behavior? (I am working with a cluster of 3 servers running Accumulo 1.8.1).

I ran also another experiment importing the same data on a different table that was configured in the same way of the previous one, but this time using the bulk import. Eventually for this table I did not have empty tablets although most of them contains few MBs, and the final space on HDFS was around 450GB. What can be the reason for such big difference on the space on disk between the batch writer API and bulk import?
Thanks.

Best Regards,
Max