[jira] [Created] (ACCUMULO-652) support block-based filtering within RFile

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (ACCUMULO-652) support block-based filtering within RFile

JIRA jira@apache.org
Adam Fuchs created ACCUMULO-652:
-----------------------------------

             Summary: support block-based filtering within RFile
                 Key: ACCUMULO-652
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-652
             Project: Accumulo
          Issue Type: Bug
            Reporter: Adam Fuchs
            Assignee: Adam Fuchs


If we keep some stats about what is in an RFile block, we might be able to efficiently [O(log N)], with high probability, implement filters that currently require linear table scans. Two use cases of this include timestamp range filtering (i.e. give me everything from last Tuesday) and cell-level security filtering (i.e. give me everything that I can see with my authorizations).

For the timestamp range filter, we can keep minimum and maximum timestamps across all keys used in a block within the index entry for that block. For the cell-level security filter, we can keep an aggregate label. This could be done using a simplified disjunction of all of the labels in the block. The extra block statistics information can propagate up the index hierarchy as well, giving nice performance characteristics for finding the next matching entry in a file.

In general, this is a heuristic technique that is good if data tends to naturally cluster in blocks with respect to the way it is queried. Testing its efficacy will require closely emulating real-world use cases -- tests like the continuous ingest test will not be sufficient. We will have to test for a few things:
# The cost for storing the extra stats in the index are not too expensive.
# The performance benefit for common use cases is significant.
# We shouldn't introduce any unacceptable worst-case behavior, like bloating the index to ridiculous proportions for any data set.

Eventually this will all need to be exposed through the Iterator API to be useful, which will be another ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (ACCUMULO-652) support block-based filtering within RFile

JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13401642#comment-13401642 ]

Todd Lipcon commented on ACCUMULO-652:
--------------------------------------

I tried to think through some similar techniques in HBASE-6014. Some of the comments there might be useful for working on this JIRA.
               

> support block-based filtering within RFile
> ------------------------------------------
>
>                 Key: ACCUMULO-652
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-652
>             Project: Accumulo
>          Issue Type: Bug
>            Reporter: Adam Fuchs
>            Assignee: Adam Fuchs
>
> If we keep some stats about what is in an RFile block, we might be able to efficiently [O(log N)], with high probability, implement filters that currently require linear table scans. Two use cases of this include timestamp range filtering (i.e. give me everything from last Tuesday) and cell-level security filtering (i.e. give me everything that I can see with my authorizations).
> For the timestamp range filter, we can keep minimum and maximum timestamps across all keys used in a block within the index entry for that block. For the cell-level security filter, we can keep an aggregate label. This could be done using a simplified disjunction of all of the labels in the block. The extra block statistics information can propagate up the index hierarchy as well, giving nice performance characteristics for finding the next matching entry in a file.
> In general, this is a heuristic technique that is good if data tends to naturally cluster in blocks with respect to the way it is queried. Testing its efficacy will require closely emulating real-world use cases -- tests like the continuous ingest test will not be sufficient. We will have to test for a few things:
> # The cost for storing the extra stats in the index are not too expensive.
> # The performance benefit for common use cases is significant.
> # We shouldn't introduce any unacceptable worst-case behavior, like bloating the index to ridiculous proportions for any data set.
> Eventually this will all need to be exposed through the Iterator API to be useful, which will be another ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (ACCUMULO-652) support block-based filtering within RFile

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13408156#comment-13408156 ]

Adam Fuchs commented on ACCUMULO-652:
-------------------------------------

Status:
* RFile index now includes block stats for column visibility and overall timestamp range
* System-level iterators and Filters now implement the pass-through optional filter interface (Filterer)
* ColumnVisibility has been updated to support "or" and "and" operations, and some basic simplifications to save space
* Unit tests exist and succeed for new MultiLevelIndex and RFile implementations

Remaining work:
# Test backwards compatibility of RFile running on previous index versions
# Write a functional test that only works if the block stat filtering is in place for both timestamp range and column visibility filters
# Run continuous ingest and wikisearch ingest performance comparisons between the 652 branch and trunk
# Plumb in the new aggregate column visibility size limit variable
# Use a better encoding for aggregate column visibilities to save space in the index blocks
# Need a better name for the Filterer interface so as not to confuse it with Filter
# Need to handle the case in which RFile Readers are reused between scans (or make this contingent on them not being reused)
# As Todd suggests, we should look into extending the block stats that are collected to include basic information about included column qualifiers, values, and maybe column families.
# For future extensibility and backwards compatibility, we should look into abstracting the block stats provider and making it pluggable
               

> support block-based filtering within RFile
> ------------------------------------------
>
>                 Key: ACCUMULO-652
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-652
>             Project: Accumulo
>          Issue Type: Bug
>            Reporter: Adam Fuchs
>            Assignee: Adam Fuchs
>
> If we keep some stats about what is in an RFile block, we might be able to efficiently [O(log N)], with high probability, implement filters that currently require linear table scans. Two use cases of this include timestamp range filtering (i.e. give me everything from last Tuesday) and cell-level security filtering (i.e. give me everything that I can see with my authorizations).
> For the timestamp range filter, we can keep minimum and maximum timestamps across all keys used in a block within the index entry for that block. For the cell-level security filter, we can keep an aggregate label. This could be done using a simplified disjunction of all of the labels in the block. The extra block statistics information can propagate up the index hierarchy as well, giving nice performance characteristics for finding the next matching entry in a file.
> In general, this is a heuristic technique that is good if data tends to naturally cluster in blocks with respect to the way it is queried. Testing its efficacy will require closely emulating real-world use cases -- tests like the continuous ingest test will not be sufficient. We will have to test for a few things:
> # The cost for storing the extra stats in the index are not too expensive.
> # The performance benefit for common use cases is significant.
> # We shouldn't introduce any unacceptable worst-case behavior, like bloating the index to ridiculous proportions for any data set.
> Eventually this will all need to be exposed through the Iterator API to be useful, which will be another ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (ACCUMULO-652) support block-based filtering within RFile

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

    [ https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409808#comment-13409808 ]

Adam Fuchs commented on ACCUMULO-652:
-------------------------------------

We uncovered another tricky point today: if we use timestamp ranges to filter out blocks that contain deletes, we might re-introduce entries that have been deleted. This would break the established semantics that support guaranteeing a logical, irreversible purge of entries (almost as bad as crossing the streams). In the same family of problems, a TimestampRangeFilter iterator would not be commutative with the VersioningIterator or any Aggregator because it would lead to incomplete/inconsistent results.

In the delete case, we need to add an index block stat that keeps track of the greatest timestamp of any delete entry. Then when we do the filtering we can include any blocks that might have deletes that are greater than the minimum timestamp in the timestamp range. This is a "must do".

The second case is a bit trickier. In the general case, we need to pull back any blocks whose key range includes any of the keys that match the given timestamp range. In the VersioningIterator case, an alternative solution would be to extend the timestamp range to include anything greater than the minimum timestamp, ignoring the max timestamp. For now, I think we need to punt on the general case and just say that the TimestampRangeFilter and other versioning/aggregation iterators are simply not compatible in all caps in the javadocs. Longer term, this should be an exemplar when we rewrite the iterator framework.
               

> support block-based filtering within RFile
> ------------------------------------------
>
>                 Key: ACCUMULO-652
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-652
>             Project: Accumulo
>          Issue Type: Bug
>            Reporter: Adam Fuchs
>            Assignee: Adam Fuchs
>
> If we keep some stats about what is in an RFile block, we might be able to efficiently [O(log N)], with high probability, implement filters that currently require linear table scans. Two use cases of this include timestamp range filtering (i.e. give me everything from last Tuesday) and cell-level security filtering (i.e. give me everything that I can see with my authorizations).
> For the timestamp range filter, we can keep minimum and maximum timestamps across all keys used in a block within the index entry for that block. For the cell-level security filter, we can keep an aggregate label. This could be done using a simplified disjunction of all of the labels in the block. The extra block statistics information can propagate up the index hierarchy as well, giving nice performance characteristics for finding the next matching entry in a file.
> In general, this is a heuristic technique that is good if data tends to naturally cluster in blocks with respect to the way it is queried. Testing its efficacy will require closely emulating real-world use cases -- tests like the continuous ingest test will not be sufficient. We will have to test for a few things:
> # The cost for storing the extra stats in the index are not too expensive.
> # The performance benefit for common use cases is significant.
> # We shouldn't introduce any unacceptable worst-case behavior, like bloating the index to ridiculous proportions for any data set.
> Eventually this will all need to be exposed through the Iterator API to be useful, which will be another ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (ACCUMULO-652) support block-based filtering within RFile

JIRA jira@apache.org
In reply to this post by JIRA jira@apache.org

     [ https://issues.apache.org/jira/browse/ACCUMULO-652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Fuchs updated ACCUMULO-652:
--------------------------------

    Fix Version/s: 1.5.0
   

> support block-based filtering within RFile
> ------------------------------------------
>
>                 Key: ACCUMULO-652
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-652
>             Project: Accumulo
>          Issue Type: Bug
>            Reporter: Adam Fuchs
>            Assignee: Adam Fuchs
>             Fix For: 1.5.0
>
>
> If we keep some stats about what is in an RFile block, we might be able to efficiently [O(log N)], with high probability, implement filters that currently require linear table scans. Two use cases of this include timestamp range filtering (i.e. give me everything from last Tuesday) and cell-level security filtering (i.e. give me everything that I can see with my authorizations).
> For the timestamp range filter, we can keep minimum and maximum timestamps across all keys used in a block within the index entry for that block. For the cell-level security filter, we can keep an aggregate label. This could be done using a simplified disjunction of all of the labels in the block. The extra block statistics information can propagate up the index hierarchy as well, giving nice performance characteristics for finding the next matching entry in a file.
> In general, this is a heuristic technique that is good if data tends to naturally cluster in blocks with respect to the way it is queried. Testing its efficacy will require closely emulating real-world use cases -- tests like the continuous ingest test will not be sufficient. We will have to test for a few things:
> # The cost for storing the extra stats in the index are not too expensive.
> # The performance benefit for common use cases is significant.
> # We shouldn't introduce any unacceptable worst-case behavior, like bloating the index to ridiculous proportions for any data set.
> Eventually this will all need to be exposed through the Iterator API to be useful, which will be another ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira