Status record lacked createdTime

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Status record lacked createdTime

Adam J. Shook
Hello folks,

One of our clusters has been throwing a handful of replication errors from the status maker -- see below.  The WAL files in question to not belong to an active tserver -- some investigation in the code shows that the createdTime could not be written and these WALs will sit here until a created time is added.

I wanted to bring some attention to this -- I think my immediate course of action here is to manually add a createdTime so the files will be replicated, then address this within the Accumulo source code itself.  Thoughts?

Status record ([begin: 0 end: 0 infiniteEnd: true closed:true]) for hdfs://foo:8020/accumulo/wal/blah/blah in table k was written to metadata table which lacked createtime

Thank you,
--Adam
Reply | Threaded
Open this post in threaded view
|

Re: Status record lacked createdTime

Josh Elser
Hey Adam,

Thanks for sharing this one.

Adam J. Shook wrote:
> Hello folks,
>
> One of our clusters has been throwing a handful of replication errors
> from the status maker -- see below.  The WAL files in question to not
> belong to an active tserver -- some investigation in the code shows that
> the createdTime could not be written and these WALs will sit here until
> a created time is added.

Does that mean that you saw an exception when the mutation written to
accumulo.metadata that had the createTime failed? Or is the cause of why
that WAL didn't get this 'attribute' still unknown?

I think the kind of fix to make it dependent on the cause here. e.g. if
this is just a bug, a standalone tool to fix this case would be good.
However, if there's an inherent issue where this case might happen and
we can't guarantee the record was written (server failure), it might be
best to add some process to the master/gc to eventually add one (e.g. if
we see the wal has been hanging out in the state, add a createdTime
after ~12hrs)

> I wanted to bring some attention to this -- I think my immediate course
> of action here is to manually add a createdTime so the files will be
> replicated, then address this within the Accumulo source code itself.
> Thoughts?
>
> Status record ([begin: 0 end: 0 infiniteEnd: true closed:true]) for
> hdfs://foo:8020/accumulo/wal/blah/blah in table k was written to
> metadata table which lacked createtime
>
> Thank you,
> --Adam
Reply | Threaded
Open this post in threaded view
|

Re: Status record lacked createdTime

Adam J. Shook
No exception on write -- this is coming from the master when it goes to assign work to the accumulo.replication table.  Some of the WALs are fairly old.

Not too sure why it didn't get the attribute; my guess is server failure before it was able to append the created time.  Some of the WAL files are empty, others have data in them.  I think a tool will suffice for now when the issue crops up, but it'll need to get fixed in the Master/GC so that, after some condition, it will assign it a createdTime so replication will occur -- or whenever the first metadata entry is added, give it a createdTime.

--Adam

On Fri, Feb 17, 2017 at 1:40 PM, Josh Elser <[hidden email]> wrote:
Hey Adam,

Thanks for sharing this one.

Adam J. Shook wrote:
Hello folks,

One of our clusters has been throwing a handful of replication errors
from the status maker -- see below.  The WAL files in question to not
belong to an active tserver -- some investigation in the code shows that
the createdTime could not be written and these WALs will sit here until
a created time is added.

Does that mean that you saw an exception when the mutation written to accumulo.metadata that had the createTime failed? Or is the cause of why that WAL didn't get this 'attribute' still unknown?

I think the kind of fix to make it dependent on the cause here. e.g. if this is just a bug, a standalone tool to fix this case would be good. However, if there's an inherent issue where this case might happen and we can't guarantee the record was written (server failure), it might be best to add some process to the master/gc to eventually add one (e.g. if we see the wal has been hanging out in the state, add a createdTime after ~12hrs)


I wanted to bring some attention to this -- I think my immediate course
of action here is to manually add a createdTime so the files will be
replicated, then address this within the Accumulo source code itself.
Thoughts?

Status record ([begin: 0 end: 0 infiniteEnd: true closed:true]) for
hdfs://foo:8020/accumulo/wal/blah/blah in table k was written to
metadata table which lacked createtime

Thank you,
--Adam

Reply | Threaded
Open this post in threaded view
|

Re: Status record lacked createdTime

Josh Elser
Cool. I know it's one of those hard things to do, but if we can be
somewhat sure it was just server failure, that's good. I'd hate to write
a workaround and find it was just because I wrote some bad code :P

Maybe something in the GC logic which presently does the closing of the
WALs would be easiest? Should be pretty easy to build a util from the
core logic too.

LMK if/how I can help.

Adam J. Shook wrote:

> No exception on write -- this is coming from the master when it goes to
> assign work to the accumulo.replication table.  Some of the WALs are
> fairly old.
>
> Not too sure why it didn't get the attribute; my guess is server failure
> before it was able to append the created time.  Some of the WAL files
> are empty, others have data in them.  I think a tool will suffice for
> now when the issue crops up, but it'll need to get fixed in the
> Master/GC so that, after some condition, it will assign it a createdTime
> so replication will occur -- or whenever the first metadata entry is
> added, give it a createdTime.
>
> --Adam
>
> On Fri, Feb 17, 2017 at 1:40 PM, Josh Elser <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hey Adam,
>
>     Thanks for sharing this one.
>
>     Adam J. Shook wrote:
>
>         Hello folks,
>
>         One of our clusters has been throwing a handful of replication
>         errors
>         from the status maker -- see below.  The WAL files in question
>         to not
>         belong to an active tserver -- some investigation in the code
>         shows that
>         the createdTime could not be written and these WALs will sit
>         here until
>         a created time is added.
>
>
>     Does that mean that you saw an exception when the mutation written
>     to accumulo.metadata that had the createTime failed? Or is the cause
>     of why that WAL didn't get this 'attribute' still unknown?
>
>     I think the kind of fix to make it dependent on the cause here. e.g.
>     if this is just a bug, a standalone tool to fix this case would be
>     good. However, if there's an inherent issue where this case might
>     happen and we can't guarantee the record was written (server
>     failure), it might be best to add some process to the master/gc to
>     eventually add one (e.g. if we see the wal has been hanging out in
>     the state, add a createdTime after ~12hrs)
>
>
>         I wanted to bring some attention to this -- I think my immediate
>         course
>         of action here is to manually add a createdTime so the files will be
>         replicated, then address this within the Accumulo source code
>         itself.
>         Thoughts?
>
>         Status record ([begin: 0 end: 0 infiniteEnd: true closed:true]) for
>         hdfs://foo:8020/accumulo/wal/blah/blah in table k was written to
>         metadata table which lacked createtime
>
>         Thank you,
>         --Adam
>
>