Corrupt WAL

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Corrupt WAL

Adam J. Shook
Hey all,

The root tablet on one of our dev systems isn't loading due to an illegal state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be the best way to mitigate this issue?  This was likely caused due to both of our NameNodes failing.

Thank you,
--Adam
Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Christopher Tubbs-2
What version are you using?

On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook <[hidden email]> wrote:
Hey all,

The root tablet on one of our dev systems isn't loading due to an illegal state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be the best way to mitigate this issue?  This was likely caused due to both of our NameNodes failing.

Thank you,
--Adam
Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Adam J. Shook
Sorry would have been good to include that :)  It's the newest 1.9.1.  I think it relates to https://github.com/apache/accumulo/pull/458, just not sure what the best thing to do here is.

On Mon, Jun 11, 2018 at 5:46 PM, Christopher <[hidden email]> wrote:
What version are you using?

On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook <[hidden email]> wrote:
Hey all,

The root tablet on one of our dev systems isn't loading due to an illegal state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be the best way to mitigate this issue?  This was likely caused due to both of our NameNodes failing.

Thank you,
--Adam

Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Christopher Tubbs-2
That's what I was thinking it was related to. Do you know if the particular WAL file was created from a previous version, from before you upgraded?

On Mon, Jun 11, 2018 at 6:00 PM Adam J. Shook <[hidden email]> wrote:
Sorry would have been good to include that :)  It's the newest 1.9.1.  I think it relates to https://github.com/apache/accumulo/pull/458, just not sure what the best thing to do here is.

On Mon, Jun 11, 2018 at 5:46 PM, Christopher <[hidden email]> wrote:
What version are you using?

On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook <[hidden email]> wrote:
Hey all,

The root tablet on one of our dev systems isn't loading due to an illegal state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be the best way to mitigate this issue?  This was likely caused due to both of our NameNodes failing.

Thank you,
--Adam

Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Adam J. Shook
The WAL is from 1.9.1.

On Mon, Jun 11, 2018 at 6:33 PM, Christopher <[hidden email]> wrote:
That's what I was thinking it was related to. Do you know if the particular WAL file was created from a previous version, from before you upgraded?

On Mon, Jun 11, 2018 at 6:00 PM Adam J. Shook <[hidden email]> wrote:
Sorry would have been good to include that :)  It's the newest 1.9.1.  I think it relates to https://github.com/apache/accumulo/pull/458, just not sure what the best thing to do here is.

On Mon, Jun 11, 2018 at 5:46 PM, Christopher <[hidden email]> wrote:
What version are you using?

On Mon, Jun 11, 2018 at 5:27 PM Adam J. Shook <[hidden email]> wrote:
Hey all,

The root tablet on one of our dev systems isn't loading due to an illegal state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be the best way to mitigate this issue?  This was likely caused due to both of our NameNodes failing.

Thank you,
--Adam


Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Keith Turner
In reply to this post by Adam J. Shook
Is the message you are seeing "COMPACTION_FINISH (without preceding
COMPACTION_START)" ?  That messages indicates that the WALs are
incomplete, probably as a result of the NN problems.  Could do the
following :

1) Run the following command to see whats in the log.  Need to see
what is there for the root tablet.

   accumulo org.apache.accumulo.tserver.logger.LogReader

2) Replace the log file with an empty file after seeing if there is
anything important in it.

I think the list of WALs for the root tablet is stored in ZK at
/accumulo/<id>/walogs

On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <[hidden email]> wrote:
> Hey all,
>
> The root tablet on one of our dev systems isn't loading due to an illegal
> state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
> the best way to mitigate this issue?  This was likely caused due to both of
> our NameNodes failing.
>
> Thank you,
> --Adam
Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Adam J. Shook
Yes, that is the error.  I'll inspect the logs and report back.

On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <[hidden email]> wrote:
Is the message you are seeing "COMPACTION_FINISH (without preceding
COMPACTION_START)" ?  That messages indicates that the WALs are
incomplete, probably as a result of the NN problems.  Could do the
following :

1) Run the following command to see whats in the log.  Need to see
what is there for the root tablet.

   accumulo org.apache.accumulo.tserver.logger.LogReader

2) Replace the log file with an empty file after seeing if there is
anything important in it.

I think the list of WALs for the root tablet is stored in ZK at
/accumulo/<id>/walogs

On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <[hidden email]> wrote:
> Hey all,
>
> The root tablet on one of our dev systems isn't loading due to an illegal
> state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd be
> the best way to mitigate this issue?  This was likely caused due to both of
> our NameNodes failing.
>
> Thank you,
> --Adam

Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Keith Turner
On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook <[hidden email]> wrote:
> Yes, that is the error.  I'll inspect the logs and report back.

Ok.  The LogReader command has a mechanism to filter which tablet is
displayed.  If the walog has  alot of data in it, may need to use
this.

Also, be aware that only 5 mutations are shown for a "many mutations"
objects in the walog.   The -m options changes this.  May want to see
more when deciding if the info in the log is important.


>
> On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <[hidden email]> wrote:
>>
>> Is the message you are seeing "COMPACTION_FINISH (without preceding
>> COMPACTION_START)" ?  That messages indicates that the WALs are
>> incomplete, probably as a result of the NN problems.  Could do the
>> following :
>>
>> 1) Run the following command to see whats in the log.  Need to see
>> what is there for the root tablet.
>>
>>    accumulo org.apache.accumulo.tserver.logger.LogReader
>>
>> 2) Replace the log file with an empty file after seeing if there is
>> anything important in it.
>>
>> I think the list of WALs for the root tablet is stored in ZK at
>> /accumulo/<id>/walogs
>>
>> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <[hidden email]>
>> wrote:
>> > Hey all,
>> >
>> > The root tablet on one of our dev systems isn't loading due to an
>> > illegal
>> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd
>> > be
>> > the best way to mitigate this issue?  This was likely caused due to both
>> > of
>> > our NameNodes failing.
>> >
>> > Thank you,
>> > --Adam
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Adam J. Shook
Looking at the log I see that the last two entries are COMPACTION_START of one RFile immediately followed by a COMPACTION_START of a separate RFile which (I believe) would lead to the error.  Would this necessarily be an issue if the compactions are for separate RFiles?

This is a dev cluster and I don't necessarily care about it, but is there a (good) means to do WAL log surgery?  I imagine I can just chop off bytes until the log is parseable and missing the info about the compactions.

On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <[hidden email]> wrote:
On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook <[hidden email]> wrote:
> Yes, that is the error.  I'll inspect the logs and report back.

Ok.  The LogReader command has a mechanism to filter which tablet is
displayed.  If the walog has  alot of data in it, may need to use
this.

Also, be aware that only 5 mutations are shown for a "many mutations"
objects in the walog.   The -m options changes this.  May want to see
more when deciding if the info in the log is important.


>
> On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <[hidden email]> wrote:
>>
>> Is the message you are seeing "COMPACTION_FINISH (without preceding
>> COMPACTION_START)" ?  That messages indicates that the WALs are
>> incomplete, probably as a result of the NN problems.  Could do the
>> following :
>>
>> 1) Run the following command to see whats in the log.  Need to see
>> what is there for the root tablet.
>>
>>    accumulo org.apache.accumulo.tserver.logger.LogReader
>>
>> 2) Replace the log file with an empty file after seeing if there is
>> anything important in it.
>>
>> I think the list of WALs for the root tablet is stored in ZK at
>> /accumulo/<id>/walogs
>>
>> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <[hidden email]>
>> wrote:
>> > Hey all,
>> >
>> > The root tablet on one of our dev systems isn't loading due to an
>> > illegal
>> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd
>> > be
>> > the best way to mitigate this issue?  This was likely caused due to both
>> > of
>> > our NameNodes failing.
>> >
>> > Thank you,
>> > --Adam
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Adam J. Shook
Sorry, I had the error backwards.  There is an OPEN for the WAL and then immediately a COMPACTION_FINISH entry.  This would cause the error.

On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook <[hidden email]> wrote:
Looking at the log I see that the last two entries are COMPACTION_START of one RFile immediately followed by a COMPACTION_START of a separate RFile which (I believe) would lead to the error.  Would this necessarily be an issue if the compactions are for separate RFiles?

This is a dev cluster and I don't necessarily care about it, but is there a (good) means to do WAL log surgery?  I imagine I can just chop off bytes until the log is parseable and missing the info about the compactions.

On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <[hidden email]> wrote:
On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook <[hidden email]> wrote:
> Yes, that is the error.  I'll inspect the logs and report back.

Ok.  The LogReader command has a mechanism to filter which tablet is
displayed.  If the walog has  alot of data in it, may need to use
this.

Also, be aware that only 5 mutations are shown for a "many mutations"
objects in the walog.   The -m options changes this.  May want to see
more when deciding if the info in the log is important.


>
> On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <[hidden email]> wrote:
>>
>> Is the message you are seeing "COMPACTION_FINISH (without preceding
>> COMPACTION_START)" ?  That messages indicates that the WALs are
>> incomplete, probably as a result of the NN problems.  Could do the
>> following :
>>
>> 1) Run the following command to see whats in the log.  Need to see
>> what is there for the root tablet.
>>
>>    accumulo org.apache.accumulo.tserver.logger.LogReader
>>
>> 2) Replace the log file with an empty file after seeing if there is
>> anything important in it.
>>
>> I think the list of WALs for the root tablet is stored in ZK at
>> /accumulo/<id>/walogs
>>
>> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <[hidden email]>
>> wrote:
>> > Hey all,
>> >
>> > The root tablet on one of our dev systems isn't loading due to an
>> > illegal
>> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.  What'd
>> > be
>> > the best way to mitigate this issue?  This was likely caused due to both
>> > of
>> > our NameNodes failing.
>> >
>> > Thank you,
>> > --Adam
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

tech.shan@gmail.com
Was there any success with this workaround strategy?  I am also experiencing this issue.

On 2018/06/13 16:30:22, "Adam J. Shook" <[hidden email]> wrote:

> Sorry, I had the error backwards.  There is an OPEN for the WAL and then
> immediately a COMPACTION_FINISH entry.  This would cause the error.
>
> On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook <[hidden email]>
> wrote:
>
> > Looking at the log I see that the last two entries are COMPACTION_START of
> > one RFile immediately followed by a COMPACTION_START of a separate RFile
> > which (I believe) would lead to the error.  Would this necessarily be an
> > issue if the compactions are for separate RFiles?
> >
> > This is a dev cluster and I don't necessarily care about it, but is there
> > a (good) means to do WAL log surgery?  I imagine I can just chop off bytes
> > until the log is parseable and missing the info about the compactions.
> >
> > On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <[hidden email]> wrote:
> >
> >> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook <[hidden email]>
> >> wrote:
> >> > Yes, that is the error.  I'll inspect the logs and report back.
> >>
> >> Ok.  The LogReader command has a mechanism to filter which tablet is
> >> displayed.  If the walog has  alot of data in it, may need to use
> >> this.
> >>
> >> Also, be aware that only 5 mutations are shown for a "many mutations"
> >> objects in the walog.   The -m options changes this.  May want to see
> >> more when deciding if the info in the log is important.
> >>
> >>
> >> >
> >> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <[hidden email]>
> >> wrote:
> >> >>
> >> >> Is the message you are seeing "COMPACTION_FINISH (without preceding
> >> >> COMPACTION_START)" ?  That messages indicates that the WALs are
> >> >> incomplete, probably as a result of the NN problems.  Could do the
> >> >> following :
> >> >>
> >> >> 1) Run the following command to see whats in the log.  Need to see
> >> >> what is there for the root tablet.
> >> >>
> >> >>    accumulo org.apache.accumulo.tserver.logger.LogReader
> >> >>
> >> >> 2) Replace the log file with an empty file after seeing if there is
> >> >> anything important in it.
> >> >>
> >> >> I think the list of WALs for the root tablet is stored in ZK at
> >> >> /accumulo/<id>/walogs
> >> >>
> >> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook <[hidden email]>
> >> >> wrote:
> >> >> > Hey all,
> >> >> >
> >> >> > The root tablet on one of our dev systems isn't loading due to an
> >> >> > illegal
> >> >> > state exception -- COMPACTION_FINISH preceding COMPACTION_START.
> >> What'd
> >> >> > be
> >> >> > the best way to mitigate this issue?  This was likely caused due to
> >> both
> >> >> > of
> >> >> > our NameNodes failing.
> >> >> >
> >> >> > Thank you,
> >> >> > --Adam
> >> >
> >> >
> >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

RE: Corrupt WAL

Ed Coleman
The has been work done in https://github.com/apache/accumulo/pull/574. I'm not certain of the state of the code, but the description may provide you with things that you could look at manually.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Tuesday, August 21, 2018 5:45 PM
To: [hidden email]
Subject: Re: Corrupt WAL

Was there any success with this workaround strategy?  I am also experiencing this issue.

On 2018/06/13 16:30:22, "Adam J. Shook" <[hidden email]> wrote:

> Sorry, I had the error backwards.  There is an OPEN for the WAL and
> then immediately a COMPACTION_FINISH entry.  This would cause the error.
>
> On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook <[hidden email]>
> wrote:
>
> > Looking at the log I see that the last two entries are
> > COMPACTION_START of one RFile immediately followed by a
> > COMPACTION_START of a separate RFile which (I believe) would lead to
> > the error.  Would this necessarily be an issue if the compactions are for separate RFiles?
> >
> > This is a dev cluster and I don't necessarily care about it, but is
> > there a (good) means to do WAL log surgery?  I imagine I can just
> > chop off bytes until the log is parseable and missing the info about the compactions.
> >
> > On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <[hidden email]> wrote:
> >
> >> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook
> >> <[hidden email]>
> >> wrote:
> >> > Yes, that is the error.  I'll inspect the logs and report back.
> >>
> >> Ok.  The LogReader command has a mechanism to filter which tablet
> >> is displayed.  If the walog has  alot of data in it, may need to
> >> use this.
> >>
> >> Also, be aware that only 5 mutations are shown for a "many mutations"
> >> objects in the walog.   The -m options changes this.  May want to see
> >> more when deciding if the info in the log is important.
> >>
> >>
> >> >
> >> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <[hidden email]>
> >> wrote:
> >> >>
> >> >> Is the message you are seeing "COMPACTION_FINISH (without
> >> >> preceding COMPACTION_START)" ?  That messages indicates that the
> >> >> WALs are incomplete, probably as a result of the NN problems.  
> >> >> Could do the following :
> >> >>
> >> >> 1) Run the following command to see whats in the log.  Need to
> >> >> see what is there for the root tablet.
> >> >>
> >> >>    accumulo org.apache.accumulo.tserver.logger.LogReader
> >> >>
> >> >> 2) Replace the log file with an empty file after seeing if there
> >> >> is anything important in it.
> >> >>
> >> >> I think the list of WALs for the root tablet is stored in ZK at
> >> >> /accumulo/<id>/walogs
> >> >>
> >> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook
> >> >> <[hidden email]>
> >> >> wrote:
> >> >> > Hey all,
> >> >> >
> >> >> > The root tablet on one of our dev systems isn't loading due to
> >> >> > an illegal state exception -- COMPACTION_FINISH preceding
> >> >> > COMPACTION_START.
> >> What'd
> >> >> > be
> >> >> > the best way to mitigate this issue?  This was likely caused
> >> >> > due to
> >> both
> >> >> > of
> >> >> > our NameNodes failing.
> >> >> >
> >> >> > Thank you,
> >> >> > --Adam
> >> >
> >> >
> >>
> >
> >
>

Reply | Threaded
Open this post in threaded view
|

Re: Corrupt WAL

Adam J. Shook
The code referenced in the PR works to detect and move a WAL, replacing it with an empty one, but isn't fully wrapped up/merged.  Some priorities were shifted and this got pushed back, though I do plan on addressing the comments in the code review Soon.

I'd suggest upgrading to 1.9.2 once you resolve the issue.  We've been running it for a while and have not had any WAL-related errors.

--Adam

On Tue, Aug 21, 2018 at 6:58 PM Ed Coleman <[hidden email]> wrote:
The has been work done in https://github.com/apache/accumulo/pull/574. I'm not certain of the state of the code, but the description may provide you with things that you could look at manually.


-----Original Message-----
From: [hidden email] [mailto:[hidden email]]
Sent: Tuesday, August 21, 2018 5:45 PM
To: [hidden email]
Subject: Re: Corrupt WAL

Was there any success with this workaround strategy?  I am also experiencing this issue.

On 2018/06/13 16:30:22, "Adam J. Shook" <[hidden email]> wrote:
> Sorry, I had the error backwards.  There is an OPEN for the WAL and
> then immediately a COMPACTION_FINISH entry.  This would cause the error.
>
> On Wed, Jun 13, 2018 at 11:34 AM, Adam J. Shook <[hidden email]>
> wrote:
>
> > Looking at the log I see that the last two entries are
> > COMPACTION_START of one RFile immediately followed by a
> > COMPACTION_START of a separate RFile which (I believe) would lead to
> > the error.  Would this necessarily be an issue if the compactions are for separate RFiles?
> >
> > This is a dev cluster and I don't necessarily care about it, but is
> > there a (good) means to do WAL log surgery?  I imagine I can just
> > chop off bytes until the log is parseable and missing the info about the compactions.
> >
> > On Tue, Jun 12, 2018 at 2:32 PM, Keith Turner <[hidden email]> wrote:
> >
> >> On Tue, Jun 12, 2018 at 12:10 PM, Adam J. Shook
> >> <[hidden email]>
> >> wrote:
> >> > Yes, that is the error.  I'll inspect the logs and report back.
> >>
> >> Ok.  The LogReader command has a mechanism to filter which tablet
> >> is displayed.  If the walog has  alot of data in it, may need to
> >> use this.
> >>
> >> Also, be aware that only 5 mutations are shown for a "many mutations"
> >> objects in the walog.   The -m options changes this.  May want to see
> >> more when deciding if the info in the log is important.
> >>
> >>
> >> >
> >> > On Tue, Jun 12, 2018 at 10:14 AM, Keith Turner <[hidden email]>
> >> wrote:
> >> >>
> >> >> Is the message you are seeing "COMPACTION_FINISH (without
> >> >> preceding COMPACTION_START)" ?  That messages indicates that the
> >> >> WALs are incomplete, probably as a result of the NN problems. 
> >> >> Could do the following :
> >> >>
> >> >> 1) Run the following command to see whats in the log.  Need to
> >> >> see what is there for the root tablet.
> >> >>
> >> >>    accumulo org.apache.accumulo.tserver.logger.LogReader
> >> >>
> >> >> 2) Replace the log file with an empty file after seeing if there
> >> >> is anything important in it.
> >> >>
> >> >> I think the list of WALs for the root tablet is stored in ZK at
> >> >> /accumulo/<id>/walogs
> >> >>
> >> >> On Mon, Jun 11, 2018 at 5:26 PM, Adam J. Shook
> >> >> <[hidden email]>
> >> >> wrote:
> >> >> > Hey all,
> >> >> >
> >> >> > The root tablet on one of our dev systems isn't loading due to
> >> >> > an illegal state exception -- COMPACTION_FINISH preceding
> >> >> > COMPACTION_START.
> >> What'd
> >> >> > be
> >> >> > the best way to mitigate this issue?  This was likely caused
> >> >> > due to
> >> both
> >> >> > of
> >> >> > our NameNodes failing.
> >> >> >
> >> >> > Thank you,
> >> >> > --Adam
> >> >
> >> >
> >>
> >
> >
>