Quantcast

Table entry count confusion

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Table entry count confusion

Jeff N
I have an interesting dilemma wherein my Accumulo cluster overview says that I have over 1.4 billion entries within the table and yet when I run scan where I keep track of unique row ids, I get back a number that is drastically less than (a little over 30 million) what the table claims to have. I read the legend and it says, "Entries: Key/value pairs over each instance, table or tablet." I was under the impression that Accumulo tables did away with duplicate rows and hence my curiosity as to why there is apparently 45 times more entries then there should be. Do I need to perform a compaction or some other action to rid my cluster of what I believe to be duplicate entries?

Thanks,
Jeff
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Table entry count confusion

Josh Elser
Yup, compaction will flush out those deletes/duplicate keys that you
may have lingering in that table and should give you an accurate
entries count on the monitor.

On Wed, Oct 2, 2013 at 4:15 PM, Mastergeek <[hidden email]> wrote:

> I have an interesting dilemma wherein my Accumulo cluster overview says that
> I have over 1.4 billion entries within the table and yet when I run scan
> where I keep track of unique row ids, I get back a number that is
> drastically less than (a little over 30 million) what the table claims to
> have. I read the legend and it says, "Entries: Key/value pairs over each
> instance, table or tablet." I was under the impression that Accumulo tables
> did away with duplicate rows and hence my curiosity as to why there is
> apparently 45 times more entries then there should be. Do I need to perform
> a compaction or some other action to rid my cluster of what I believe to be
> duplicate entries?
>
> Thanks,
> Jeff
>
>
>
> -----
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629.html
> Sent from the Developers mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Table entry count confusion

Adam Fuchs
In reply to this post by Jeff N
The count that displays in the monitor is the sum of all the key/value
pairs that are in the files that back Accumulo. You can also get this count
by doing a scan of the !METADATA table and looking at the values associated
with keys in the "file" column family. Inserting the same key twice could
result in one key in one file or two keys in two files. At query time,
those keys will get deduplicated by the VersioningIterator, providing a
view that only has one key.

45x seems really high, since a tablet tends to have an average of maybe 4-8
files associated with it at the billion entry scale (rough estimate). There
could be other considerations, like cell-level security eliminating entries
from the view that the scanner gives you, or maybe major compactions are
not running properly for you? Your backing data could also include a large
number of deletes, which could throw off the stats. Deletes are implemented
as a tombstone marker, and are only eliminated when a full major compaction
happens. Forcing a major compaction by running the compact command in the
shell should give you some better evidence to diagnose the confusion.

Cheers,
Adam



On Wed, Oct 2, 2013 at 4:15 PM, Mastergeek <[hidden email]> wrote:

> I have an interesting dilemma wherein my Accumulo cluster overview says
> that
> I have over 1.4 billion entries within the table and yet when I run scan
> where I keep track of unique row ids, I get back a number that is
> drastically less than (a little over 30 million) what the table claims to
> have. I read the legend and it says, "Entries: Key/value pairs over each
> instance, table or tablet." I was under the impression that Accumulo tables
> did away with duplicate rows and hence my curiosity as to why there is
> apparently 45 times more entries then there should be. Do I need to perform
> a compaction or some other action to rid my cluster of what I believe to be
> duplicate entries?
>
> Thanks,
> Jeff
>
>
>
> -----
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629.html
> Sent from the Developers mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Table entry count confusion

Billie Rinaldi-2
In reply to this post by Jeff N
Does your table have more than one key/value per row id?  The monitor
counts key/value pairs, not rows.

Billie


On Wed, Oct 2, 2013 at 1:15 PM, Mastergeek <[hidden email]> wrote:

> I have an interesting dilemma wherein my Accumulo cluster overview says
> that
> I have over 1.4 billion entries within the table and yet when I run scan
> where I keep track of unique row ids, I get back a number that is
> drastically less than (a little over 30 million) what the table claims to
> have. I read the legend and it says, "Entries: Key/value pairs over each
> instance, table or tablet." I was under the impression that Accumulo tables
> did away with duplicate rows and hence my curiosity as to why there is
> apparently 45 times more entries then there should be. Do I need to perform
> a compaction or some other action to rid my cluster of what I believe to be
> duplicate entries?
>
> Thanks,
> Jeff
>
>
>
> -----
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629.html
> Sent from the Developers mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Table entry count confusion

Jeff N
In reply to this post by Josh Elser
I know it has been a while, but the command I should run to perform a compaction on a given table would just be the following if I wanted the whole table to compact?

compact -t <tablename>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Table entry count confusion

Josh Elser
You got it!

On 10/07/2013 05:39 PM, Mastergeek wrote:

> I know it has been a while, but the command I should run to perform a
> compaction on a given table would just be the following if I wanted the
> whole table to compact?
>
> compact -t <tablename>
>
>
>
> -----
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629p5661.html
> Sent from the Developers mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Table entry count confusion

Jeff N
In reply to this post by Billie Rinaldi-2
Yes, each rowid has numerous column qualifiers per column family, but I assumed that the all of that was still wrapped in a single row.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Table entry count confusion

Billie Rinaldi-2
In your original email, you appeared to be using the concept of rows / row
ids and the concept of entries / key-value pairs interchangeably.  A row is
a set of key-value pairs (aka entries) with the same row id.  You said you
counted the unique row ids, and that the number of entries reported by the
monitor was about 45 times the number of row ids.  This would be expected
if you have an average of 45 key-value pairs per row.

Billie


On Mon, Oct 7, 2013 at 5:42 PM, Mastergeek <[hidden email]> wrote:

> Yes, each rowid has numerous column qualifiers per column family, but I
> assumed that the all of that was still wrapped in a single row.
>
>
>
> -----
>
>
>
> --
> View this message in context:
> http://apache-accumulo.1065345.n5.nabble.com/Table-entry-count-confusion-tp5629p5662.html
> Sent from the Developers mailing list archive at Nabble.com.
>
Loading...