Sorted RowId suffix retrieval using Server Side Iterators

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Sorted RowId suffix retrieval using Server Side Iterators

damodaram.sundaram@harman.com
We are storing the RDF statement data to Accumulo in the POS(Predicate,Object, Subject) fashion. The table is designed to store 100 million records.

Ex:
p1|o1|s1
p1|o1|s5
p1|o2|s3
p1|o2|s2
p2|o1|s4

The data is sorted based on the fist two parts of the key, (p1 & o1 etc).

When I apply a prefix range with (p1|o1  to p2|o1), I could get the subjects in the order [s1, s5, s3, s2, s4].

But with the my scan would perform back and forth on the table and I would be interested to get the list of subjects as [s1, s2, s3, s4, s5] while reading through the iterators.

Is there anyway I can get the above result ?

Also, on the same table if I apply the Range filter then I would get distinct order sets like [s2, s3, s5] and [s200, s150, s500] etc. Even in this case, how should I make the scanner to read the data in the single sorted order.







Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Sorted RowId suffix retrieval using Server Side Iterators

Dylan Hutchison-5
Let's see if I understand your question.  The queries are range queries on
P over the POS table.  Within each range, you would like to sort the S
values (a suffix of the Key) retrieved.

Are your range queries *small enough to fit in memory*?  If so, you could
gather all the entries in the range together, either at a client or in a
server-side iterator, and sort the S values.  The server-side iterator
approach will only work if your S values are stored in the Column portion
of the key (not the Row), because if they are stored in the Row then the
range query may hit multiple tablets which could be stored on separate
tablet servers.  Of course, you could construct a partial list of the S
values seen in each tablet.

If your range queries exceed memory, then you might try an external sorting
method or create an index on S.

The right choice depends on what you would like to do with the S values.

On Wed, Jul 5, 2017 at 11:39 PM, [hidden email] <
[hidden email]> wrote:

> We are storing the RDF statement data to Accumulo in the
> POS(Predicate,Object, Subject) fashion. The table is designed to store 100
> million records.
>
> Ex:
> p1|o1|s1
> p1|o1|s5
> p1|o2|s3
> p1|o2|s2
> p2|o1|s4
>
> The data is sorted based on the fist two parts of the key, (p1 & o1 etc).
>
> When I apply a prefix range with (p1|o1  to p2|o1), I could get the
> subjects
> in the order [s1, s5, s3, s2, s4].
>
> But with the my scan would perform back and forth on the table and I would
> be interested to get the list of subjects as [s1, s2, s3, s4, s5] while
> reading through the iterators.
>
> Is there anyway I can get the above result ?
>
> Also, on the same table if I apply the Range filter then I would get
> distinct order sets like [s2, s3, s5] and [s200, s150, s500] etc. Even in
> this case, how should I make the scanner to read the data in the single
> sorted order.
>
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context: http://apache-accumulo.
> 1065345.n5.nabble.com/Sorted-RowId-suffix-retrieval-using-
> Server-Side-Iterators-tp21787.html
> Sent from the Developers mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Sorted RowId suffix retrieval using Server Side Iterators

damodaram.sundaram@harman.com
Thanks for your reply Dylan.

Are your range queries *small enough to fit in memory*? Not likely, because given condition on POS table might result few hundred thousands as I am talking about my table would be 100M. Hence, I might not be able to store them in the memory to the Sorting and I might end up getting memory issues.

My tables are built with RowIds as  POS in it and not on the column family as I am looking at each cell of my relational data into a single Row at accumulo.

The 'S values' will be used to query the SPO table with prefix filter on S, which is stored (Subject|Predicate|Object). If my subjects are in the sorted order then I would not need to put much effort while querying with "List of Order Set of Subjects".
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Sorted RowId suffix retrieval using Server Side Iterators

Dylan Hutchison-5
You might be able to take a batched approach, using server-side iterators
to gather as many S's from POS rows as possible at each tablet server up to
a memory budget, and then querying the SPO table from inside those
iterators.  (With some caution to be mindful of tablet server thread
limits, you can scan another table from inside a server-side iterator.)
 This likely has the effect of querying the same SPO data multiple times,
which may or may not be acceptable.

Another alternative is a MapReduce job.

By the way, you don't necessarily need to sort the S's in order to query
the SPO table.  It depends on how you do the query, such as by providing a
collection of ranges to a Scanner / BatchScanner or doing server-side
filtering.

Cheers, Dylan

On Thu, Jul 6, 2017 at 3:05 AM, [hidden email] <
[hidden email]> wrote:

> Thanks for your reply Dylan.
>
> *Are your range queries *small enough to fit in memory*?* Not likely,
> because given condition on POS table might result few hundred thousands as
> I
> am talking about my table would be 100M. Hence, I might not be able to
> store
> them in the memory to the Sorting and I might end up getting memory issues.
>
> My tables are built with RowIds as  POS in it and not on the column family
> as I am looking at each cell of my relational data into a single Row at
> accumulo.
>
> The 'S values' will be used to query the SPO table with prefix filter on S,
> which is stored (Subject|Predicate|Object). If my subjects are in the
> sorted
> order then I would not need to put much effort while querying with "List of
> Order Set of Subjects".
>
>
>
> --
> View this message in context: http://apache-accumulo.
> 1065345.n5.nabble.com/Sorted-RowId-suffix-retrieval-using-
> Server-Side-Iterators-tp21787p21791.html
> Sent from the Developers mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Sorted RowId suffix retrieval using Server Side Iterators

Josh Elser
On Mon, Jul 10, 2017 at 6:15 AM, Dylan Hutchison
<[hidden email]> wrote:

> You might be able to take a batched approach, using server-side iterators
> to gather as many S's from POS rows as possible at each tablet server up to
> a memory budget, and then querying the SPO table from inside those
> iterators.  (With some caution to be mindful of tablet server thread
> limits, you can scan another table from inside a server-side iterator.)
>  This likely has the effect of querying the same SPO data multiple times,
> which may or may not be acceptable.
>
> Another alternative is a MapReduce job.
>
> By the way, you don't necessarily need to sort the S's in order to query
> the SPO table.  It depends on how you do the query, such as by providing a
> collection of ranges to a Scanner / BatchScanner or doing server-side
> filtering.

+1 to that. Dropping the requirement to get a sorted list of subjects
for some pair P-O would make a server-side filter much easier. You can
also play tricks like doing a "limited" deduplication server-side. You
can hold up to N subjects server-side to avoid running out of memory,
and then perform a final deduplication client-side.

> Cheers, Dylan
>
> On Thu, Jul 6, 2017 at 3:05 AM, [hidden email] <
> [hidden email]> wrote:
>
>> Thanks for your reply Dylan.
>>
>> *Are your range queries *small enough to fit in memory*?* Not likely,
>> because given condition on POS table might result few hundred thousands as
>> I
>> am talking about my table would be 100M. Hence, I might not be able to
>> store
>> them in the memory to the Sorting and I might end up getting memory issues.
>>
>> My tables are built with RowIds as  POS in it and not on the column family
>> as I am looking at each cell of my relational data into a single Row at
>> accumulo.
>>
>> The 'S values' will be used to query the SPO table with prefix filter on S,
>> which is stored (Subject|Predicate|Object). If my subjects are in the
>> sorted
>> order then I would not need to put much effort while querying with "List of
>> Order Set of Subjects".
>>
>>
>>
>> --
>> View this message in context: http://apache-accumulo.
>> 1065345.n5.nabble.com/Sorted-RowId-suffix-retrieval-using-
>> Server-Side-Iterators-tp21787p21791.html
>> Sent from the Developers mailing list archive at Nabble.com.
>>
Loading...