Metadata DataFileValue not Matching the Output of rfile-info Command

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Metadata DataFileValue not Matching the Output of rfile-info Command

Dong Zhou
Hi all,

We have noticed that the Accumulo metadata entry reports certain RFile has file size but no entry number.
For example, <tableId>;<tabletEndRow> file:hdfs://apps/accumulo/tables/<tableId>/<folder>/I001ahdz.rf []   48,0

From Metadata's perspective, it looks like this the RFile contains zero entries, but if we run an RFILE-INFO command against the same file, the outcome shows that the RFile has a bunch of entries. If we dump the RFile, we can see that it spills out the actual data too. 

We wonder what is the reason behind it.

Thanks,
-Dong Zhou
Reply | Threaded
Open this post in threaded view
|

Re: Metadata DataFileValue not Matching the Output of rfile-info Command

Michael Wall
Hi Dong,

That file is the result of a bulk import.  I can tell because it starts with a capital "I", see http://accumulo.apache.org/1.8/accumulo_user_manual.html#_file_naming_conventions.  Bulk files are inspected on import to find all the ranges of data they contain.  They are then assigned to all the tablets hosting that data.  So one "I" file can belong to more than one tablet.  When that file is included in a compaction, the data that is not part of the range the tablet is hosting is not rewritten to the new files.

When inspecting "I" files, Accumulo does not keep track of how many keys are in each range.  So for "I" files in the metadata table, the number of keys is 0 until that file is compacted.

HTH

Mike

 

On Tue, Feb 13, 2018 at 1:37 PM Dong Zhou <[hidden email]> wrote:
Hi all,

We have noticed that the Accumulo metadata entry reports certain RFile has file size but no entry number.
For example, <tableId>;<tabletEndRow> file:hdfs://apps/accumulo/tables/<tableId>/<folder>/I001ahdz.rf []   48,0

From Metadata's perspective, it looks like this the RFile contains zero entries, but if we run an RFILE-INFO command against the same file, the outcome shows that the RFile has a bunch of entries. If we dump the RFile, we can see that it spills out the actual data too. 

We wonder what is the reason behind it.

Thanks,
-Dong Zhou
Reply | Threaded
Open this post in threaded view
|

Re: Metadata DataFileValue not Matching the Output of rfile-info Command

Dong Zhou
I see. Yes, the file is loaded via bulk import.
I would like to find out the most precise number of entries a table contains, would running a compaction, and then scanning metadata table for the entry number be sufficient method?
Also, what would happen is merge operation runs before the compaction? Would it try to merge this tablet into other tablets since the file size and entry number look fair small at the time it scans the metadata table? Or, it would compact the table before running the merge. 

By the way, thanks for the quick reply. :)

Cheers,
-Dong



On Tue, Feb 13, 2018 at 11:05 AM Michael Wall <[hidden email]> wrote:
Hi Dong,

That file is the result of a bulk import.  I can tell because it starts with a capital "I", see http://accumulo.apache.org/1.8/accumulo_user_manual.html#_file_naming_conventions.  Bulk files are inspected on import to find all the ranges of data they contain.  They are then assigned to all the tablets hosting that data.  So one "I" file can belong to more than one tablet.  When that file is included in a compaction, the data that is not part of the range the tablet is hosting is not rewritten to the new files.

When inspecting "I" files, Accumulo does not keep track of how many keys are in each range.  So for "I" files in the metadata table, the number of keys is 0 until that file is compacted.

HTH

Mike

 

On Tue, Feb 13, 2018 at 1:37 PM Dong Zhou <[hidden email]> wrote:
Hi all,

We have noticed that the Accumulo metadata entry reports certain RFile has file size but no entry number.
For example, <tableId>;<tabletEndRow> file:hdfs://apps/accumulo/tables/<tableId>/<folder>/I001ahdz.rf []   48,0

From Metadata's perspective, it looks like this the RFile contains zero entries, but if we run an RFILE-INFO command against the same file, the outcome shows that the RFile has a bunch of entries. If we dump the RFile, we can see that it spills out the actual data too. 

We wonder what is the reason behind it.

Thanks,
-Dong Zhou
Reply | Threaded
Open this post in threaded view
|

Re: Metadata DataFileValue not Matching the Output of rfile-info Command

Michael Wall
Yes, compact the table and then count the entries.  You can get close by looking at the monitor page.  The tables list has a column called Entries which should be close to counting up those entries in the metadata by hand.


On Tue, Feb 13, 2018 at 2:19 PM Dong Zhou <[hidden email]> wrote:
I see. Yes, the file is loaded via bulk import.
I would like to find out the most precise number of entries a table contains, would running a compaction, and then scanning metadata table for the entry number be sufficient method?
Also, what would happen is merge operation runs before the compaction? Would it try to merge this tablet into other tablets since the file size and entry number look fair small at the time it scans the metadata table? Or, it would compact the table before running the merge. 

By the way, thanks for the quick reply. :)

Cheers,
-Dong



On Tue, Feb 13, 2018 at 11:05 AM Michael Wall <[hidden email]> wrote:
Hi Dong,

That file is the result of a bulk import.  I can tell because it starts with a capital "I", see http://accumulo.apache.org/1.8/accumulo_user_manual.html#_file_naming_conventions.  Bulk files are inspected on import to find all the ranges of data they contain.  They are then assigned to all the tablets hosting that data.  So one "I" file can belong to more than one tablet.  When that file is included in a compaction, the data that is not part of the range the tablet is hosting is not rewritten to the new files.

When inspecting "I" files, Accumulo does not keep track of how many keys are in each range.  So for "I" files in the metadata table, the number of keys is 0 until that file is compacted.

HTH

Mike

 

On Tue, Feb 13, 2018 at 1:37 PM Dong Zhou <[hidden email]> wrote:
Hi all,

We have noticed that the Accumulo metadata entry reports certain RFile has file size but no entry number.
For example, <tableId>;<tabletEndRow> file:hdfs://apps/accumulo/tables/<tableId>/<folder>/I001ahdz.rf []   48,0

From Metadata's perspective, it looks like this the RFile contains zero entries, but if we run an RFILE-INFO command against the same file, the outcome shows that the RFile has a bunch of entries. If we dump the RFile, we can see that it spills out the actual data too. 

We wonder what is the reason behind it.

Thanks,
-Dong Zhou