Key Refactroing

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Key Refactroing

Sven Hodapp
Hi there,

I would like to select a subset of a Accumulo talbe and refactor the keys to create a new table.
There are about 30M records with a value size about 5-20KB each.
I'm using Accumulo 1.8.0 and Java accumulo-core client library 1.8.0.

I've written client code like that:

 * create a scanner fetching a specific column in a specific range
 * transforming the key into the new schema
 * using a batch writer to write the new generated mutations into the new table

    scan = createScanner(FROM, auths)
    // range, fetchColumn
    writer = createBatchWriter(TO, configWriter)
    iter = scan.iterator()
    while (iter.hasNext()) {
        entry = iter.next()
        // create mutation with new key schema, but unaltered value
        writer.addMutation(mutation)
    }
    writer.close()

But this is slow and error prone (hiccups, ...).
Is it possible to use the Accumulo shell for such a task?
Are there another solutions I can use or some tricks?

Thank you very much for any advices!

Regards,
Sven

--
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
[hidden email]
www.scai.fraunhofer.de
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Key Refactroing

Dylan Hutchison-5
Hi Sven,

There are other solutions that depend on what your Key schema transformation is.  

If the new schema is order-compatible with the old one, meaning that the new Keys have the same sort order as the old keys, then you could (1) clone the table and (2) attach a server-side SortedKeyValueIterator (SKVI) that performs the transformation on all iterator scopes.  This will change the schema "on the fly".  Even if your new schema is order-compatible with the old schema up to a prefix (say, up to the Row), you could use this trick inside your SKVI by (1) gathering all keys within that prefix (e.g., WholeRowIterator), (2) transforming each gathered Key, and (3) emitting the new Keys in sorted order.

If your Key schema transformation non-monotonically changes the Key sort order, there are fewer built-in Accumulo options.  You might look at the iterator framework provided by the Graphulo library.  Graphulo is built to do complex server-side data processing, reading in entries from some number of tables and writing them out to a new table at the server (see RemoteWriteIterator).  Disclaimer: I authored Graphulo.

If you decide to go with your original solution, you might consider running multiple such Accumulo clients in parallel.

Cheers, Dylan

On Wed, Jun 21, 2017 at 1:49 AM, Sven Hodapp <[hidden email]> wrote:
Hi there,

I would like to select a subset of a Accumulo talbe and refactor the keys to create a new table.
There are about 30M records with a value size about 5-20KB each.
I'm using Accumulo 1.8.0 and Java accumulo-core client library 1.8.0.

I've written client code like that:

 * create a scanner fetching a specific column in a specific range
 * transforming the key into the new schema
 * using a batch writer to write the new generated mutations into the new table

    scan = createScanner(FROM, auths)
    // range, fetchColumn
    writer = createBatchWriter(TO, configWriter)
    iter = scan.iterator()
    while (iter.hasNext()) {
        entry = iter.next()
        // create mutation with new key schema, but unaltered value
        writer.addMutation(mutation)
    }
    writer.close()

But this is slow and error prone (hiccups, ...).
Is it possible to use the Accumulo shell for such a task?
Are there another solutions I can use or some tricks?

Thank you very much for any advices!

Regards,
Sven

--
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
[hidden email]
www.scai.fraunhofer.de

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Key Refactroing

Adam Fuchs
In reply to this post by Sven Hodapp
Sven,

You might consider using a combination of AccumuloInputFormat and AccumuloFileOutputFormat in a map/reduce job. The job will run in parallel, speeding up your transformation, the map/reduce framework should help with hiccups, and the bulk load at the end provides a atomic, eventually consistent commit. These input/output formats can also be used with other job frameworks like Spark. See for example:

examples/simple/src/main/java/org/apache/accumulo/examples/simple/mapreduce/TableToFile.java
examples/simple/src/main/java/org/apache/accumulo/examples/simple/mapreduce/bulk/BulkIngestExample.java

Cheers,
Adam



On Wed, Jun 21, 2017 at 1:49 AM, Sven Hodapp <[hidden email]> wrote:
Hi there,

I would like to select a subset of a Accumulo talbe and refactor the keys to create a new table.
There are about 30M records with a value size about 5-20KB each.
I'm using Accumulo 1.8.0 and Java accumulo-core client library 1.8.0.

I've written client code like that:

 * create a scanner fetching a specific column in a specific range
 * transforming the key into the new schema
 * using a batch writer to write the new generated mutations into the new table

    scan = createScanner(FROM, auths)
    // range, fetchColumn
    writer = createBatchWriter(TO, configWriter)
    iter = scan.iterator()
    while (iter.hasNext()) {
        entry = iter.next()
        // create mutation with new key schema, but unaltered value
        writer.addMutation(mutation)
    }
    writer.close()

But this is slow and error prone (hiccups, ...).
Is it possible to use the Accumulo shell for such a task?
Are there another solutions I can use or some tricks?

Thank you very much for any advices!

Regards,
Sven

--
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
[hidden email]
www.scai.fraunhofer.de

Loading...