recovering Accumulo instance from missing root WALs (deleted by gc)

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

recovering Accumulo instance from missing root WALs (deleted by gc)

Jonathan LASKO-2
Hello Accumulo wizards,

I have a large schema of test data in an Accumulo instance that is currently inaccessible which I would like to recover, if possible. I'll explain the problem in hopes that some folks who know the intricacies of the Accumulo root table, WAL, and recovery processes can tell me whether there are any additional actions to take or whether I should treat this schema as hosed.

The problem is similar to what was reported here (https://community.hortonworks.com/questions/52718/failed-to-locate-tablet-for-table-0-row-err.html), i.e. no tablets are loaded except one from accumulo.root, and the logs are repeating these message rapidly:

==> monitor_stti-master.bbn.com.debug.log <==
2017-04-21 07:10:55,047 [impl.ThriftScanner] DEBUG:  Failed to locate tablet for table : !0 row : ~err_

==> master_stti-master.bbn.com.debug.log <==
2017-04-21 07:10:55,430 [master.Master] DEBUG: Finished gathering information from 13 servers in 0.03 seconds
2017-04-21 07:10:55,430 [master.Master] DEBUG: not balancing because there are unhosted tablets: 2

The RecoveryManager insists that it is trying to recover five WALs:

2017-04-21 07:28:48,349 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/0d28801e-322e-44e6-97e3-a34a14b4bd1a
2017-04-21 07:28:48,358 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9 to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/696d4353-0041-4397-a1f5-b8600b5cb2e9
2017-04-21 07:28:48,362 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3 to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/e62f4195-c7d6-419a-a696-ff89b10cecc3
2017-04-21 07:28:48,366 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/01a0887e-4ac8-4772-8f5f-b99371e1df0a
2017-04-21 07:28:48,369 [recovery.RecoveryManager] DEBUG: Recovering hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105 to hdfs://stti-nn-01.bbn.com:8020/accumulo/recovery/6f392ec5-821b-4fd5-83e4-baf1f47d8105

Based on the advice from the post linked above, I grepped the logs and was able to confirm that all five of those WALs were actually deleted (here's the output from my grep; note the earlier timestamps):

gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,275 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105] from stti-data-102.bbn.com+10011
gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,280 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3] from stti-data-103.bbn.com+10011
gc_stti-master.bbn.com.debug.log.3:2017-04-03 20:25:26,699 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a] from stti-data-103.bbn.com+10011
gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:32:11,106 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a] from stti-data-102.bbn.com+10011
gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:37:14,875 [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9] from stti-data-103.bbn.com+10011

All five WALs appear in references in the accumulo.root table:

!0;~ log:stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a []    hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a|1
!0;~ log:stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9 []    hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9|1
!0;~ log:stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3 []    hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3|1
...
!0< log:stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a []    hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a|1
!0< log:stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105 []    hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105|1

I also see observe three outstanding fate transactions (at least two of which appear to me to be related to the accumulo.root table):

root@bbn-beta> fate print
txid: 6b33fa130909f05d  status: IN_PROGRESS         op: CompactRange     locked: [R:+accumulo, R:!0] locking: []              top: CompactionDriver
txid: 564d758d584af61e  status: IN_PROGRESS         op: CompactRange     locked: [R:+accumulo, R:!0] locking: []              top: CompactionDriver
txid: 4a620317a53a4a93  status: IN_PROGRESS         op: CreateTable      locked: [W:5e, R:+default] locking: []              top: PopulateMetadata

I checked in ZooKeeper and the /accumulo/$INSTANCE/root_tablet/walogs and /accumulo/$INSTANCE/recovery/[locks] directories are all empty.

I don't know exactly what to do at this point. I could:

a) Try deleting the fate operations and see if that releases the Accumulo instance.
b) Try deleting the accumulo.root table entries pointing to the already-deleted WALs.
c) Call it quits on this instance, blow it away, and start re-generating my test data over the weekend.

Given option (c), I would most likely try options (a) and (b) first (and probably in that order). But I would love to get some insight from the Accumulo experts first.

Thanks in advance,

Jonathan
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: recovering Accumulo instance from missing root WALs (deleted by gc)

Michael Wall
Jonathan,

Sorry you are having problems here.  What version are you using?  Do you
have hdfs trash turned on?  If so, look for those logs in the trash.  If
you find them, simply move them back.

If you can't find them, then the data is gone.  You can try an "hdfs dfs
-touchz <path>" on the locations and things should recover.  But again,
there will be data loss on the root, which will cascade to the metadata
table and so on.  Typically that means an old copy, so if you can determine
when the problem started and replay all the ingest since that time, you can
recover.

Take a look at https://issues.apache.org/jira/browse/ACCUMULO-4157 and see
if this seems like what happened.

Mike


On Fri, Apr 21, 2017 at 7:43 AM Jonathan LASKO <[hidden email]>
wrote:

> Hello Accumulo wizards,
>
> I have a large schema of test data in an Accumulo instance that is
> currently inaccessible which I would like to recover, if possible. I'll
> explain the problem in hopes that some folks who know the intricacies of
> the Accumulo root table, WAL, and recovery processes can tell me whether
> there are any additional actions to take or whether I should treat this
> schema as hosed.
>
> The problem is similar to what was reported here (
> https://community.hortonworks.com/questions/52718/failed-to-locate-tablet-for-table-0-row-err.html),
> i.e. no tablets are loaded except one from accumulo.root, and the logs are
> repeating these message rapidly:
>
> ==> monitor_stti-master.bbn.com.debug.log <==
> 2017-04-21 07:10:55,047 [impl.ThriftScanner] DEBUG:  Failed to locate
> tablet for table : !0 row : ~err_
>
> ==> master_stti-master.bbn.com.debug.log <==
> 2017-04-21 07:10:55,430 [master.Master] DEBUG: Finished gathering
> information from 13 servers in 0.03 seconds
> 2017-04-21 07:10:55,430 [master.Master] DEBUG: not balancing because there
> are unhosted tablets: 2
>
> The RecoveryManager insists that it is trying to recover five WALs:
>
> 2017-04-21 07:28:48,349 [recovery.RecoveryManager] DEBUG: Recovering
> hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/0d28801e-322e-44e6-97e3-a34a14b4bd1a
> 2017-04-21
> <http://stti-nn-01.bbn.com:8020/accumulo/recovery/0d28801e-322e-44e6-97e3-a34a14b4bd1a2017-04-21>
> 07:28:48,358 [recovery.RecoveryManager] DEBUG: Recovering hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/696d4353-0041-4397-a1f5-b8600b5cb2e9
> 2017-04-21
> <http://stti-nn-01.bbn.com:8020/accumulo/recovery/696d4353-0041-4397-a1f5-b8600b5cb2e92017-04-21>
> 07:28:48,362 [recovery.RecoveryManager] DEBUG: Recovering hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/e62f4195-c7d6-419a-a696-ff89b10cecc3
> 2017-04-21
> <http://stti-nn-01.bbn.com:8020/accumulo/recovery/e62f4195-c7d6-419a-a696-ff89b10cecc32017-04-21>
> 07:28:48,366 [recovery.RecoveryManager] DEBUG: Recovering hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/01a0887e-4ac8-4772-8f5f-b99371e1df0a
> 2017-04-21
> <http://stti-nn-01.bbn.com:8020/accumulo/recovery/01a0887e-4ac8-4772-8f5f-b99371e1df0a2017-04-21>
> 07:28:48,369 [recovery.RecoveryManager] DEBUG: Recovering hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105
> to hdfs://
> stti-nn-01.bbn.com:8020/accumulo/recovery/6f392ec5-821b-4fd5-83e4-baf1f47d8105
>
> Based on the advice from the post linked above, I grepped the logs and was
> able to confirm that all five of those WALs were actually deleted (here's
> the output from my grep; note the earlier timestamps):
>
> gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,275
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105]
> from stti-data-102.bbn.com+10011
> gc_stti-master.bbn.com.debug.log.2:2017-04-12 14:49:36,280
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3]
> from stti-data-103.bbn.com+10011
> gc_stti-master.bbn.com.debug.log.3:2017-04-03 20:25:26,699
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a]
> from stti-data-103.bbn.com+10011
> gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:32:11,106
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a]
> from stti-data-102.bbn.com+10011
> gc_stti-master.bbn.com.debug.log.3:2017-04-08 16:37:14,875
> [gc.GarbageCollectWriteAheadLogs] DEBUG: deleted [hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9]
> from stti-data-103.bbn.com+10011
>
> All five WALs appear in references in the accumulo.root table:
>
> !0;~ log:
> stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/0d28801e-322e-44e6-97e3-a34a14b4bd1a%7C1>
> !0;~ log:
> stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/696d4353-0041-4397-a1f5-b8600b5cb2e9%7C1>
> !0;~ log:
> stti-data-103.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-103.bbn.com+10011/e62f4195-c7d6-419a-a696-ff89b10cecc3%7C1>
> ...
> !0< log:
> stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/01a0887e-4ac8-4772-8f5f-b99371e1df0a%7C1>
> !0< log:
> stti-data-102.bbn.com:10011/hdfs://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105
> []    hdfs://
> stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105|1
> <http://stti-nn-01.bbn.com:8020/accumulo/wal/stti-data-102.bbn.com+10011/6f392ec5-821b-4fd5-83e4-baf1f47d8105%7C1>
>
> I also see observe three outstanding fate transactions (at least two of
> which appear to me to be related to the accumulo.root table):
>
> root@bbn-beta> fate print
> txid: 6b33fa130909f05d  status: IN_PROGRESS         op: CompactRange
>  locked: [R:+accumulo, R:!0] locking: []              top: CompactionDriver
> txid: 564d758d584af61e  status: IN_PROGRESS         op: CompactRange
>  locked: [R:+accumulo, R:!0] locking: []              top: CompactionDriver
> txid: 4a620317a53a4a93  status: IN_PROGRESS         op: CreateTable
> locked: [W:5e, R:+default] locking: []              top: PopulateMetadata
>
> I checked in ZooKeeper and the /accumulo/$INSTANCE/root_tablet/walogs and
> /accumulo/$INSTANCE/recovery/[locks] directories are all empty.
>
> I don't know exactly what to do at this point. I could:
>
> a) Try deleting the fate operations and see if that releases the Accumulo
> instance.
> b) Try deleting the accumulo.root table entries pointing to the
> already-deleted WALs.
> c) Call it quits on this instance, blow it away, and start re-generating
> my test data over the weekend.
>
> Given option (c), I would most likely try options (a) and (b) first (and
> probably in that order). But I would love to get some insight from the
> Accumulo experts first.
>
> Thanks in advance,
>
> Jonathan
>
Loading...