[jira] [Commented] (ACCUMULO-473) Support binary search within RFile blocks

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (ACCUMULO-473) Support binary search within RFile blocks

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/ACCUMULO-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404374#comment-13404374 ]

Keith Turner commented on ACCUMULO-473:
---------------------------------------

One other thing I like about building a transient per block index is that it cleanly seperates indexing from the relative key encoding.  We can improve the relative key encoding independant of indexing as long as the index can still jump to an arbitrary point.
               

> Support binary search within RFile blocks
> -----------------------------------------
>
>                 Key: ACCUMULO-473
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-473
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>            Reporter: Keith Turner
>            Assignee: Keith Turner
>             Fix For: 1.5.0
>
>
> RFiles store blocks of key values using relative encoding.  By default these blocks are small (100k).  To find a key in an RFile the index is used to find the block, then the block is scanned for the key.  It would be nice for iterators that do alot of seeking if a binary search were done instead of a scan.  
> Relative encoding is a form of compression that serializes a key relative to the last key.  For example if the row in a key is the same as the previous key, then the row is not stored again.  This works well with sorted data.   The current on disk format does not support random access within a block, because to read entry N all previous entries must be read.   One option would be to deserialize the block into a form that supports random access and then do binary search.  However this will consume memory and CPU.  If the relative encoding format could be modified to support random access, then binary search could be supported in a memory and CPU efficient way.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

Re: [jira] [Commented] (ACCUMULO-473) Support binary search within RFile blocks

David Medinets
Can this behavior be simulated to get a feel for expected improvements?
On Jun 29, 2012 10:45 PM, "Keith Turner (JIRA)" <[hidden email]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/ACCUMULO-473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13404374#comment-13404374]
>
> Keith Turner commented on ACCUMULO-473:
> ---------------------------------------
>
> One other thing I like about building a transient per block index is that
> it cleanly seperates indexing from the relative key encoding.  We can
> improve the relative key encoding independant of indexing as long as the
> index can still jump to an arbitrary point.
>
> > Support binary search within RFile blocks
> > -----------------------------------------
> >
> >                 Key: ACCUMULO-473
> >                 URL: https://issues.apache.org/jira/browse/ACCUMULO-473
> >             Project: Accumulo
> >          Issue Type: Improvement
> >          Components: tserver
> >            Reporter: Keith Turner
> >            Assignee: Keith Turner
> >             Fix For: 1.5.0
> >
> >
> > RFiles store blocks of key values using relative encoding.  By default
> these blocks are small (100k).  To find a key in an RFile the index is used
> to find the block, then the block is scanned for the key.  It would be nice
> for iterators that do alot of seeking if a binary search were done instead
> of a scan.
> > Relative encoding is a form of compression that serializes a key
> relative to the last key.  For example if the row in a key is the same as
> the previous key, then the row is not stored again.  This works well with
> sorted data.   The current on disk format does not support random access
> within a block, because to read entry N all previous entries must be read.
>   One option would be to deserialize the block into a form that supports
> random access and then do binary search.  However this will consume memory
> and CPU.  If the relative encoding format could be modified to support
> random access, then binary search could be supported in a memory and CPU
> efficient way.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>