Can I connect an InputStream to a Mutation value?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Can I connect an InputStream to a Mutation value?

David Medinets
Some of the XML records that I work with are over 50M. I was hoping to
store them inside of Accumulo instead of the text-based HDFS XML super
file currently being used. However, since they are so large I can't
create a Value object without running out of memory. Storing values
this large may simply be using the wrong tool, please let me know.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can I connect an InputStream to a Mutation value?

John Vines
There's no way to do that. But if you're simply running out of memory in your ingest, you may just have to up your ingestors heap size and decrease your batchwriter.

John

On Sun, Jun 17, 2012 at 9:54 AM, David Medinets <[hidden email]> wrote:
Some of the XML records that I work with are over 50M. I was hoping to
store them inside of Accumulo instead of the text-based HDFS XML super
file currently being used. However, since they are so large I can't
create a Value object without running out of memory. Storing values
this large may simply be using the wrong tool, please let me know.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can I connect an InputStream to a Mutation value?

Billie J Rinaldi
In reply to this post by David Medinets
Look at the filedata example. It splits up files into chunks of a given size, puts each in a separate value, and reads them back as an input stream.

Billie
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can I connect an InputStream to a Mutation value?

Jim Klucar
In reply to this post by David Medinets
David,

Can you give a taste of the schema of the XML? With that we may be
able to help break the XML file up into keys and help create an index
for it. IMHO that's the power you would get from accumulo. If you just
want it as one big lump, and don't need to search it or only retrieve
portions of the file, then putting it in accumulo is just adding
overhead to hdfs.


Sent from my iPhone

On Jun 17, 2012, at 9:54 AM, David Medinets <[hidden email]> wrote:

> Some of the XML records that I work with are over 50M. I was hoping to
> store them inside of Accumulo instead of the text-based HDFS XML super
> file currently being used. However, since they are so large I can't
> create a Value object without running out of memory. Storing values
> this large may simply be using the wrong tool, please let me know.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can I connect an InputStream to a Mutation value?

David Medinets
Thanks for the offer. I thinking of a situation were I don't know the
schema ahead of time. For example, a JMS queue that I simply want to
store the XML somewhere. And let some other program parse it. This is
a thought experiment.

On Sun, Jun 17, 2012 at 1:06 PM, Jim Klucar <[hidden email]> wrote:

> David,
>
> Can you give a taste of the schema of the XML? With that we may be
> able to help break the XML file up into keys and help create an index
> for it. IMHO that's the power you would get from accumulo. If you just
> want it as one big lump, and don't need to search it or only retrieve
> portions of the file, then putting it in accumulo is just adding
> overhead to hdfs.
>
>
> Sent from my iPhone
>
> On Jun 17, 2012, at 9:54 AM, David Medinets <[hidden email]> wrote:
>
>> Some of the XML records that I work with are over 50M. I was hoping to
>> store them inside of Accumulo instead of the text-based HDFS XML super
>> file currently being used. However, since they are so large I can't
>> create a Value object without running out of memory. Storing values
>> this large may simply be using the wrong tool, please let me know.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can I connect an InputStream to a Mutation value?

Marc P.
I'm sorry, I must be missing something.

Why does the schema matter? If you were to build keys from all
attributes and elements, you could, at any point, rebuild the XML
document. You could store the heirarchy, by virtue of your keys.

If you were to do that, the previous suggestions would be applicable.
Realistically, if you stored the entire XML file into a given
key/value pair, your heap elements will be borne upon thrift reception
( at the client ), therefore, streaming would only add complexity and
additional memory overhead. It wouldn't give you what you want.

Splitting the file amongst keys can maintain hierarchy, allow you to
rebuild the XML doc, and store large records into the value.

On Mon, Jun 18, 2012 at 2:00 PM, David Medinets
<[hidden email]> wrote:

> Thanks for the offer. I thinking of a situation were I don't know the
> schema ahead of time. For example, a JMS queue that I simply want to
> store the XML somewhere. And let some other program parse it. This is
> a thought experiment.
>
> On Sun, Jun 17, 2012 at 1:06 PM, Jim Klucar <[hidden email]> wrote:
>> David,
>>
>> Can you give a taste of the schema of the XML? With that we may be
>> able to help break the XML file up into keys and help create an index
>> for it. IMHO that's the power you would get from accumulo. If you just
>> want it as one big lump, and don't need to search it or only retrieve
>> portions of the file, then putting it in accumulo is just adding
>> overhead to hdfs.
>>
>>
>> Sent from my iPhone
>>
>> On Jun 17, 2012, at 9:54 AM, David Medinets <[hidden email]> wrote:
>>
>>> Some of the XML records that I work with are over 50M. I was hoping to
>>> store them inside of Accumulo instead of the text-based HDFS XML super
>>> file currently being used. However, since they are so large I can't
>>> create a Value object without running out of memory. Storing values
>>> this large may simply be using the wrong tool, please let me know.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Can I connect an InputStream to a Mutation value?

Adam Fuchs
There's also the concern of elements of the document that are too large by themselves. A general purpose streaming solution would include support for any kind of objects passed in, not just XML with small elements. I think the fact that it is an XML document is probably a red herring in this case.

In the past, what we have done is solve this on the application side by breaking up large objects into chunks and then using a key structure that groups and maintains the order of the chunks. This usually means that we append a sequence number to the column qualifier using an integer encoding. The filedata example that Billie referred to does this. Accumulo would benefit from some sort of general purpose fragmentation solution for streaming large objects, and an InputStream/OutputStream solution might be good for that. Sounds like a fun project!

Adam


On Mon, Jun 18, 2012 at 2:06 PM, Marc P. <[hidden email]> wrote:
I'm sorry, I must be missing something.

Why does the schema matter? If you were to build keys from all
attributes and elements, you could, at any point, rebuild the XML
document. You could store the heirarchy, by virtue of your keys.

If you were to do that, the previous suggestions would be applicable.
Realistically, if you stored the entire XML file into a given
key/value pair, your heap elements will be borne upon thrift reception
( at the client ), therefore, streaming would only add complexity and
additional memory overhead. It wouldn't give you what you want.

Splitting the file amongst keys can maintain hierarchy, allow you to
rebuild the XML doc, and store large records into the value.

On Mon, Jun 18, 2012 at 2:00 PM, David Medinets
<[hidden email]> wrote:
> Thanks for the offer. I thinking of a situation were I don't know the
> schema ahead of time. For example, a JMS queue that I simply want to
> store the XML somewhere. And let some other program parse it. This is
> a thought experiment.
>
> On Sun, Jun 17, 2012 at 1:06 PM, Jim Klucar <[hidden email]> wrote:
>> David,
>>
>> Can you give a taste of the schema of the XML? With that we may be
>> able to help break the XML file up into keys and help create an index
>> for it. IMHO that's the power you would get from accumulo. If you just
>> want it as one big lump, and don't need to search it or only retrieve
>> portions of the file, then putting it in accumulo is just adding
>> overhead to hdfs.
>>
>>
>> Sent from my iPhone
>>
>> On Jun 17, 2012, at 9:54 AM, David Medinets <[hidden email]> wrote:
>>
>>> Some of the XML records that I work with are over 50M. I was hoping to
>>> store them inside of Accumulo instead of the text-based HDFS XML super
>>> file currently being used. However, since they are so large I can't
>>> create a Value object without running out of memory. Storing values
>>> this large may simply be using the wrong tool, please let me know.

Loading...