Node reports assignment failed for tablet

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Node reports assignment failed for tablet

Nick Wise

Hello,

 

I’m seeing a lot of errors such as the following across my production cluster, which has 30 nodes and is running Accumulo 1.7 on Hadoop 2.7.1.  The system has been running for many months without error.  I would appreciate any guidance that can be given particularly if I should, or indeed should not, stop the cluster in order to resolve.

 

There are over 60 billion elements in the table associated with this tablet, and rebuilding from scratch would be very difficult.  I can stand some data loss and re-ingest recent data, if that restores service.

 

From /usr/local/accumulo/logs/master_master02.log:

 

2018-03-17 22:02:16,527 [master.Master] ERROR: node11:9997 reports assignment failed for tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,560 [master.Master] ERROR: node25:9997 reports assignment failed for tablet i7;^A^@^C51008b0d-1fc7-4742-bc4f-67ec280c7ebc^@80000152d;^A^@^C50fb7a13-1f0f-4943-a94f-26fcd8d15439^@8000014d1e

2018-03-17 22:02:16,574 [master.Master] ERROR: node30:9997 reports assignment failed for tablet i7;^A^@^C52d07ff3-677b-44b2-bef5-23d027946401^@80000159fc;^A^@^C52cc4173-6e57-4ef0-81a7-879e77a7d820^@80000152

2018-03-17 22:02:16,586 [master.Master] ERROR: node06:9997 reports assignment failed for tablet i6;02~1~posDataFeature~gcn~20170228;02~1~posDataFeature~gbv~201608

2018-03-17 22:02:16,589 [master.Master] ERROR: node16:9997 reports assignment failed for tablet i6;00~1~posDataFeature~tbr~2016081;00~1~posDataFeature~t9w~201604

2018-03-17 22:02:16,616 [master.Master] ERROR: node26:9997 reports assignment failed for tablet i6;02~1~posDataFeature~dxf~201607;02~1~posDataFeature~dvt~201602

2018-03-17 22:02:16,694 [master.Master] ERROR: node17:9997 reports assignment failed for tablet i6;17~0~posDataFeature~k42~201505;17~0~posDataFeature~k3u~20160818

2018-03-17 22:02:16,778 [master.Master] ERROR: node07:9997 reports assignment failed for tablet i6;10~0~posDataFeature~u33~2017122;10~0~posDataFeature~u1x~20160207

2018-03-17 22:02:16,810 [master.Master] ERROR: node05:9997 reports assignment failed for tablet i6;10~1~posDataFeature~rqe~20140805;10~1~posDataFeature~rqc~20160309

2018-03-17 22:02:16,825 [master.Master] ERROR: node18:9997 reports assignment failed for tablet i6;06~1~posDataFeature~6pn~2015082;06~1~posDataFeature~6nx~20170514

2018-03-17 22:02:16,827 [master.Master] ERROR: node33:9997 reports assignment failed for tablet i6;09~1~posDataFeature~fu4~20170624;09~1~posDataFeature~fgc~20160928

2018-03-17 22:02:16,859 [master.Master] ERROR: node31:9997 reports assignment failed for tablet i7;^A^@^Cc5e97d3d-f3d0-4c80-acfb-b4de1d2aaa0e^@8000014a62;^A^@^Cc5e44ab1-a0a1-4d9c-abce-250a27209c15^@80000156fd

2018-03-17 22:02:16,920 [master.Master] ERROR: node14:9997 reports assignment failed for tablet i6;22~0~posDataFeature~xqs~2016;22~0~posDataFeature~xn5~2016012

2018-03-17 22:02:16,938 [master.Master] ERROR: node22:9997 reports assignment failed for tablet i6;23~1~posDataFeature~kdm~2015013;23~1~posDataFeature~kdh~20160708

2018-03-17 22:02:16,981 [master.Master] ERROR: node15:9997 reports assignment failed for tablet i6;07~1~posDataFeature~w7e~2016092;07~1~posDataFeature~w49

2018-03-17 22:02:17,194 [master.Master] ERROR: node21:9997 reports assignment failed for tablet i7;^A^@^C92dba50b-219b-46d0-932a-a12a58ca830f^@800001523;^A^@^C92da7a96-7011-41d0-82ea-1a39ae7b1a6d^@8000015959

2018-03-17 22:02:17,209 [master.Master] ERROR: node13:9997 reports assignment failed for tablet i7;^A^@^Cc05;^A^@^Cc04d6645-9a4b-42b8-a8cc-58ff19e0b957^@800001465

2018-03-17 22:02:17,574 [master.Master] ERROR: node30:9997 reports assignment failed for tablet i6;17~1~posDataFeature~e7;17~1~posDataFeature~e4r~20160509

2018-03-17 22:02:17,586 [master.Master] ERROR: node06:9997 reports assignment failed for tablet i6;01~1~posDataFeature~d3v~20160222;01~1~posDataFeature~d3f~201707072

2018-03-17 22:02:17,590 [master.Master] ERROR: node16:9997 reports assignment failed for tablet i7;^A^@^C88fe01d0-243b-487c-ab0d-30b02e8ccf69^@80000152c;^A^@^C88f2eba5-9c1f-4230-8dd2-d1a9f0f659bd^@8000015d

2018-03-17 22:02:17,617 [master.Master] ERROR: node26:9997 reports assignment failed for tablet i6;05~1~posDataFeature~d1f;05~1~posDataFeature~d0t~201507

2018-03-17 22:02:17,694 [master.Master] ERROR: node17:9997 reports assignment failed for tablet i7;^A^@^C3df85419-e448-4fa3-87b3-ec95edae204b^@80000153;^A^@^C3df47e59-05af-43e9-a650-50685f66ec0e^@80000153d

2018-03-17 22:02:17,827 [master.Master] ERROR: node33:9997 reports assignment failed for tablet i7;^A^@^C50f11344-0f80-435e-a6fa-7312619e1535^@80000142b8;^A^@^C50eb8126-a75c-4d89-9767-2a6a71c6bfac^@800001552a

2018-03-17 22:02:17,859 [master.Master] ERROR: node31:9997 reports assignment failed for tablet i6;05~0~posDataFeature~pz8;05~0~posDataFeature~my0~20141

2018-03-17 22:02:17,920 [master.Master] ERROR: node14:9997 reports assignment failed for tablet i6;14~1~posDataFeature~7ej~201606;14~1~posDataFeature~7ds~201608

2018-03-17 22:02:17,938 [master.Master] ERROR: node22:9997 reports assignment failed for tablet i6;18~1~posDataFeature~u3f~2016031;18~1~posDataFeature~u3b~20150816

2018-03-17 22:02:17,981 [master.Master] ERROR: node15:9997 reports assignment failed for tablet i6;14~0~posDataFeature~wtq~20151019;14~0~posDataFeature~wt3~20170605

2018-03-17 22:02:18,194 [master.Master] ERROR: node21:9997 reports assignment failed for tablet i6;11~0~posDataFeature~dyq~201607;11~0~posDataFeature~drm~201507

 

From /usr/local/accumulo/logs/ tserver_node11.log

 

2018-03-17 22:02:16,516 [tserver.TabletServer] INFO : adding tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 back to the assignment pool (retry 129)

2018-03-17 22:02:16,517 [tserver.TabletServer] INFO : node11:9997: got assignment from master: i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,521 [tablet.Tablet] INFO : Starting Write-Ahead Log recovery for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,521 [tserver.TabletServer] INFO : Looking for hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished

2018-03-17 22:02:16,521 [log.SortedLogRecovery] INFO : Looking at mutations from hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237 for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : exception trying to assign tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 hdfs://master01:9000/user/accumulo/accumulo/tables/i6/t-011gdek

java.lang.RuntimeException: java.io.IOException: java.io.EOFException

        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:639)

        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)

        at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2157)

        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)

        at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)

        at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)

        at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.IOException: java.io.EOFException

        at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:456)

        at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)

        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:589)

        ... 9 more

Caused by: java.io.EOFException

        at java.io.DataInputStream.readFully(DataInputStream.java:197)

        at java.io.DataInputStream.readFully(DataInputStream.java:169)

        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1848)

        at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813)

        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762)

        at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:443)

        at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)

        at org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)

        at org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)

        at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:454)

        ... 11 more

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : java.io.IOException: java.io.EOFException

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : failed to open tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 reporting failure to master

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : rescheduling tablet load in 600.00 seconds

 

 

The same structure error is occurring on many (if not all, all that I have so far checked) nodes across the cluster.  From what I have looked at 8bd07d5c-710f-4072-b351-8ce09d771237 appears to be a common feature, while the other elements vary.

 

The file hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished exists and is zero bytes.  The folder hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237 has two further folders within, part-r-00000 and part-r-00001.  Both have files within called data and index. 

 

The data file in part-r-00000 is 1071KB ends abruptly, thus:

 

       +  2xœc```  


    f   Z
       "  3 Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05htsj1.rfxœc```  
       

       #  3xœc```  


    f   Z
       $  3 Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05huonw.rfxœc```  
       

       %  3xœc```  


    f   Z
       &  3 Khdfs://master01:9000/u

 

The index file in part-r-00000 is zero bytes.

 

The data file in part-r-00001 is 31906KB and ends thus (which looks reasonable to me):

 


          xœc```lP22ª3¨+È/vI,ItKM,)-J­+54©320´006434.b*MUI5HK1661ÒM³05Õ5ILLÓµ0KNÓM5²HL23H1624e``pc[1].D(ÞÊ´=I_´û#£ƒ _«
PŸƒçά
@ZÓˆ±÷šÛD ^í)•   Ì  


          xœc```jP22ª3¨+È/vI,ItKM,)-J­+54©320´006434.b*ÎS17HK1661ÒM³05Õ5±0KÑM2N5ÖMJ³02LI34KM5g``pc[1].D(ÞÎÈÀÀ˜¤/ºo#£
ßÔç
@}
ž«š iM#ÆÞkn“ˆs[J±J*Â:s uÉ&ºII&)ºÉf‰©‰Ææ–f¦fp·¡xÄm[1]å@·ñžÛÜ v[—
Üm“70Rv

 

The index file in part-t-00001 is 5KB and equally looks reasonable.

 

Any help or direction that you might be able to give would be most gratefully received.

 

Best regards,

 

Nick

 

 

This email (and any attachments) may contain confidential information and is intended solely for the recipient(s) to whom the email is addressed. If you received this email in error, please inform us immediately and delete the email and all attachments without further using, copying or disclosing the information. This email and any attachments are believed to be, but cannot be guaranteed to be, secure or virus-free. Satellite Applications Catapult Limited is registered in England & Wales. Company Number: 7964746. Registered office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire OX11 0QR.
Reply | Threaded
Open this post in threaded view
|

Re: Node reports assignment failed for tablet

Michael Wall
Hi Nick,

Looks like you are on the right track with the recovery file, which are created when WALs have to be replayed, see http://accumulo.apache.org/1.8/accumulo_user_manual.html#_recovery.  Maybe try deleting hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished and give it 10 min or so.  I could be that one or both of those part files are bad, so your next step could be to remove the hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/ directory entirely.  Again, give Accumulo 10 min or more.  I don't recall how to track 8bd07d5c-710f-4072-b351-8ce09d771237 back to the WALs, maybe look for the first occurrence of that in the logs to see if the WALs are still there.  If not, maybe move hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/ instead of delete. If you can figure out what is in the those WALs you will know what to replay.

Good luck

Mike

On Sat, Mar 17, 2018 at 7:14 PM Nick Wise <[hidden email]> wrote:

Hello,

 

I’m seeing a lot of errors such as the following across my production cluster, which has 30 nodes and is running Accumulo 1.7 on Hadoop 2.7.1.  The system has been running for many months without error.  I would appreciate any guidance that can be given particularly if I should, or indeed should not, stop the cluster in order to resolve.

 

There are over 60 billion elements in the table associated with this tablet, and rebuilding from scratch would be very difficult.  I can stand some data loss and re-ingest recent data, if that restores service.

 

From /usr/local/accumulo/logs/master_master02.log:

 

2018-03-17 22:02:16,527 [master.Master] ERROR: node11:9997 reports assignment failed for tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,560 [master.Master] ERROR: node25:9997 reports assignment failed for tablet i7;^A^@^C51008b0d-1fc7-4742-bc4f-67ec280c7ebc^@80000152d;^A^@^C50fb7a13-1f0f-4943-a94f-26fcd8d15439^@8000014d1e

2018-03-17 22:02:16,574 [master.Master] ERROR: node30:9997 reports assignment failed for tablet i7;^A^@^C52d07ff3-677b-44b2-bef5-23d027946401^@80000159fc;^A^@^C52cc4173-6e57-4ef0-81a7-879e77a7d820^@80000152

2018-03-17 22:02:16,586 [master.Master] ERROR: node06:9997 reports assignment failed for tablet i6;02~1~posDataFeature~gcn~20170228;02~1~posDataFeature~gbv~201608

2018-03-17 22:02:16,589 [master.Master] ERROR: node16:9997 reports assignment failed for tablet i6;00~1~posDataFeature~tbr~2016081;00~1~posDataFeature~t9w~201604

2018-03-17 22:02:16,616 [master.Master] ERROR: node26:9997 reports assignment failed for tablet i6;02~1~posDataFeature~dxf~201607;02~1~posDataFeature~dvt~201602

2018-03-17 22:02:16,694 [master.Master] ERROR: node17:9997 reports assignment failed for tablet i6;17~0~posDataFeature~k42~201505;17~0~posDataFeature~k3u~20160818

2018-03-17 22:02:16,778 [master.Master] ERROR: node07:9997 reports assignment failed for tablet i6;10~0~posDataFeature~u33~2017122;10~0~posDataFeature~u1x~20160207

2018-03-17 22:02:16,810 [master.Master] ERROR: node05:9997 reports assignment failed for tablet i6;10~1~posDataFeature~rqe~20140805;10~1~posDataFeature~rqc~20160309

2018-03-17 22:02:16,825 [master.Master] ERROR: node18:9997 reports assignment failed for tablet i6;06~1~posDataFeature~6pn~2015082;06~1~posDataFeature~6nx~20170514

2018-03-17 22:02:16,827 [master.Master] ERROR: node33:9997 reports assignment failed for tablet i6;09~1~posDataFeature~fu4~20170624;09~1~posDataFeature~fgc~20160928

2018-03-17 22:02:16,859 [master.Master] ERROR: node31:9997 reports assignment failed for tablet i7;^A^@^Cc5e97d3d-f3d0-4c80-acfb-b4de1d2aaa0e^@8000014a62;^A^@^Cc5e44ab1-a0a1-4d9c-abce-250a27209c15^@80000156fd

2018-03-17 22:02:16,920 [master.Master] ERROR: node14:9997 reports assignment failed for tablet i6;22~0~posDataFeature~xqs~2016;22~0~posDataFeature~xn5~2016012

2018-03-17 22:02:16,938 [master.Master] ERROR: node22:9997 reports assignment failed for tablet i6;23~1~posDataFeature~kdm~2015013;23~1~posDataFeature~kdh~20160708

2018-03-17 22:02:16,981 [master.Master] ERROR: node15:9997 reports assignment failed for tablet i6;07~1~posDataFeature~w7e~2016092;07~1~posDataFeature~w49

2018-03-17 22:02:17,194 [master.Master] ERROR: node21:9997 reports assignment failed for tablet i7;^A^@^C92dba50b-219b-46d0-932a-a12a58ca830f^@800001523;^A^@^C92da7a96-7011-41d0-82ea-1a39ae7b1a6d^@8000015959

2018-03-17 22:02:17,209 [master.Master] ERROR: node13:9997 reports assignment failed for tablet i7;^A^@^Cc05;^A^@^Cc04d6645-9a4b-42b8-a8cc-58ff19e0b957^@800001465

2018-03-17 22:02:17,574 [master.Master] ERROR: node30:9997 reports assignment failed for tablet i6;17~1~posDataFeature~e7;17~1~posDataFeature~e4r~20160509

2018-03-17 22:02:17,586 [master.Master] ERROR: node06:9997 reports assignment failed for tablet i6;01~1~posDataFeature~d3v~20160222;01~1~posDataFeature~d3f~201707072

2018-03-17 22:02:17,590 [master.Master] ERROR: node16:9997 reports assignment failed for tablet i7;^A^@^C88fe01d0-243b-487c-ab0d-30b02e8ccf69^@80000152c;^A^@^C88f2eba5-9c1f-4230-8dd2-d1a9f0f659bd^@8000015d

2018-03-17 22:02:17,617 [master.Master] ERROR: node26:9997 reports assignment failed for tablet i6;05~1~posDataFeature~d1f;05~1~posDataFeature~d0t~201507

2018-03-17 22:02:17,694 [master.Master] ERROR: node17:9997 reports assignment failed for tablet i7;^A^@^C3df85419-e448-4fa3-87b3-ec95edae204b^@80000153;^A^@^C3df47e59-05af-43e9-a650-50685f66ec0e^@80000153d

2018-03-17 22:02:17,827 [master.Master] ERROR: node33:9997 reports assignment failed for tablet i7;^A^@^C50f11344-0f80-435e-a6fa-7312619e1535^@80000142b8;^A^@^C50eb8126-a75c-4d89-9767-2a6a71c6bfac^@800001552a

2018-03-17 22:02:17,859 [master.Master] ERROR: node31:9997 reports assignment failed for tablet i6;05~0~posDataFeature~pz8;05~0~posDataFeature~my0~20141

2018-03-17 22:02:17,920 [master.Master] ERROR: node14:9997 reports assignment failed for tablet i6;14~1~posDataFeature~7ej~201606;14~1~posDataFeature~7ds~201608

2018-03-17 22:02:17,938 [master.Master] ERROR: node22:9997 reports assignment failed for tablet i6;18~1~posDataFeature~u3f~2016031;18~1~posDataFeature~u3b~20150816

2018-03-17 22:02:17,981 [master.Master] ERROR: node15:9997 reports assignment failed for tablet i6;14~0~posDataFeature~wtq~20151019;14~0~posDataFeature~wt3~20170605

2018-03-17 22:02:18,194 [master.Master] ERROR: node21:9997 reports assignment failed for tablet i6;11~0~posDataFeature~dyq~201607;11~0~posDataFeature~drm~201507

 

From /usr/local/accumulo/logs/ tserver_node11.log

 

2018-03-17 22:02:16,516 [tserver.TabletServer] INFO : adding tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 back to the assignment pool (retry 129)

2018-03-17 22:02:16,517 [tserver.TabletServer] INFO : node11:9997: got assignment from master: i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,521 [tablet.Tablet] INFO : Starting Write-Ahead Log recovery for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,521 [tserver.TabletServer] INFO : Looking for hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished

2018-03-17 22:02:16,521 [log.SortedLogRecovery] INFO : Looking at mutations from hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237 for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : exception trying to assign tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 hdfs://master01:9000/user/accumulo/accumulo/tables/i6/t-011gdek

java.lang.RuntimeException: java.io.IOException: java.io.EOFException

        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:639)

        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)

        at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2157)

        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)

        at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)

        at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)

        at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.IOException: java.io.EOFException

        at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:456)

        at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)

        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:589)

        ... 9 more

Caused by: java.io.EOFException

        at java.io.DataInputStream.readFully(DataInputStream.java:197)

        at java.io.DataInputStream.readFully(DataInputStream.java:169)

        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1848)

        at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813)

        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762)

        at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:443)

        at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)

        at org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)

        at org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)

        at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:454)

        ... 11 more

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : java.io.IOException: java.io.EOFException

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : failed to open tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 reporting failure to master

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : rescheduling tablet load in 600.00 seconds

 

 

The same structure error is occurring on many (if not all, all that I have so far checked) nodes across the cluster.  From what I have looked at 8bd07d5c-710f-4072-b351-8ce09d771237 appears to be a common feature, while the other elements vary.

 

The file hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished exists and is zero bytes.  The folder hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237 has two further folders within, part-r-00000 and part-r-00001.  Both have files within called data and index. 

 

The data file in part-r-00000 is 1071KB ends abruptly, thus:

 

       +  2xœc```  


    f   Z
       "  3 Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05htsj1.rfxœc```  
       

       #  3xœc```  


    f   Z
       $  3 Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05huonw.rfxœc```  
       

       %  3xœc```  


    f   Z
       &  3 Khdfs://master01:9000/u

 

The index file in part-r-00000 is zero bytes.

 

The data file in part-r-00001 is 31906KB and ends thus (which looks reasonable to me):

 


          xœc```lP22ª3¨+È/vI,ItKM,)-J­+54©320´006434.b*MUI5HK1661ÒM³05Õ5ILLÓµ0KNÓM5²HL23H1624e``pc[1] .D(ÞÊ ´=I_´û#£ƒ _«
PŸƒçά
@ZÓˆ±÷šÛD ^í)•   Ì  


          xœc```jP22ª3¨+È/vI,ItKM,)-J­+54©320´006434.b*ÎS17HK1661ÒM³05Õ5±0KÑM2N5ÖMJ³02LI34KM5g``pc[1] .D(ÞÎÈÀÀ˜¤/ºo#£
ßÔç
@}
ž«š iM#ÆÞkn“ˆs[J±J*Â: s uÉ&ºII&)ºÉf‰ ©‰Ææ–f¦fp· ¡xÄm[1]å @·ñžÛÜ v[—
Üm“70Rv

 

The index file in part-t-00001 is 5KB and equally looks reasonable.

 

Any help or direction that you might be able to give would be most gratefully received.

 

Best regards,

 

Nick

 

 

This email (and any attachments) may contain confidential information and is intended solely for the recipient(s) to whom the email is addressed. If you received this email in error, please inform us immediately and delete the email and all attachments without further using, copying or disclosing the information. This email and any attachments are believed to be, but cannot be guaranteed to be, secure or virus-free. Satellite Applications Catapult Limited is registered in England & Wales. Company Number: 7964746. Registered office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire OX11 0QR.
Reply | Threaded
Open this post in threaded view
|

RE: Node reports assignment failed for tablet

Nick Wise

Hi Mike,

 

Thank you very much indeed!  I tried the rename option and the cluster seems to have recovered, and I’m able to continue work.  I will work out what’s missing next.

 

Thank you again, it was very helpful to know what the valid options were.

 

Nick

 

 

 

From: Michael Wall [mailto:[hidden email]]
Sent: 18 March 2018 13:14
To: [hidden email]
Cc: Stephen Wotton <[hidden email]>
Subject: Re: Node reports assignment failed for tablet

 

Hi Nick,

 

Looks like you are on the right track with the recovery file, which are created when WALs have to be replayed, see http://accumulo.apache.org/1.8/accumulo_user_manual.html#_recovery.  Maybe try deleting hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished and give it 10 min or so.  I could be that one or both of those part files are bad, so your next step could be to remove the hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/ directory entirely.  Again, give Accumulo 10 min or more.  I don't recall how to track 8bd07d5c-710f-4072-b351-8ce09d771237 back to the WALs, maybe look for the first occurrence of that in the logs to see if the WALs are still there.  If not, maybe move hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/ instead of delete. If you can figure out what is in the those WALs you will know what to replay.

 

Good luck

 

Mike

 

On Sat, Mar 17, 2018 at 7:14 PM Nick Wise <[hidden email]> wrote:

Hello,

 

I’m seeing a lot of errors such as the following across my production cluster, which has 30 nodes and is running Accumulo 1.7 on Hadoop 2.7.1.  The system has been running for many months without error.  I would appreciate any guidance that can be given particularly if I should, or indeed should not, stop the cluster in order to resolve.

 

There are over 60 billion elements in the table associated with this tablet, and rebuilding from scratch would be very difficult.  I can stand some data loss and re-ingest recent data, if that restores service.

 

From /usr/local/accumulo/logs/master_master02.log:

 

2018-03-17 22:02:16,527 [master.Master] ERROR: node11:9997 reports assignment failed for tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,560 [master.Master] ERROR: node25:9997 reports assignment failed for tablet i7;^A^@^C51008b0d-1fc7-4742-bc4f-67ec280c7ebc^@80000152d;^A^@^C50fb7a13-1f0f-4943-a94f-26fcd8d15439^@8000014d1e

2018-03-17 22:02:16,574 [master.Master] ERROR: node30:9997 reports assignment failed for tablet i7;^A^@^C52d07ff3-677b-44b2-bef5-23d027946401^@80000159fc;^A^@^C52cc4173-6e57-4ef0-81a7-879e77a7d820^@80000152

2018-03-17 22:02:16,586 [master.Master] ERROR: node06:9997 reports assignment failed for tablet i6;02~1~posDataFeature~gcn~20170228;02~1~posDataFeature~gbv~201608

2018-03-17 22:02:16,589 [master.Master] ERROR: node16:9997 reports assignment failed for tablet i6;00~1~posDataFeature~tbr~2016081;00~1~posDataFeature~t9w~201604

2018-03-17 22:02:16,616 [master.Master] ERROR: node26:9997 reports assignment failed for tablet i6;02~1~posDataFeature~dxf~201607;02~1~posDataFeature~dvt~201602

2018-03-17 22:02:16,694 [master.Master] ERROR: node17:9997 reports assignment failed for tablet i6;17~0~posDataFeature~k42~201505;17~0~posDataFeature~k3u~20160818

2018-03-17 22:02:16,778 [master.Master] ERROR: node07:9997 reports assignment failed for tablet i6;10~0~posDataFeature~u33~2017122;10~0~posDataFeature~u1x~20160207

2018-03-17 22:02:16,810 [master.Master] ERROR: node05:9997 reports assignment failed for tablet i6;10~1~posDataFeature~rqe~20140805;10~1~posDataFeature~rqc~20160309

2018-03-17 22:02:16,825 [master.Master] ERROR: node18:9997 reports assignment failed for tablet i6;06~1~posDataFeature~6pn~2015082;06~1~posDataFeature~6nx~20170514

2018-03-17 22:02:16,827 [master.Master] ERROR: node33:9997 reports assignment failed for tablet i6;09~1~posDataFeature~fu4~20170624;09~1~posDataFeature~fgc~20160928

2018-03-17 22:02:16,859 [master.Master] ERROR: node31:9997 reports assignment failed for tablet i7;^A^@^Cc5e97d3d-f3d0-4c80-acfb-b4de1d2aaa0e^@8000014a62;^A^@^Cc5e44ab1-a0a1-4d9c-abce-250a27209c15^@80000156fd

2018-03-17 22:02:16,920 [master.Master] ERROR: node14:9997 reports assignment failed for tablet i6;22~0~posDataFeature~xqs~2016;22~0~posDataFeature~xn5~2016012

2018-03-17 22:02:16,938 [master.Master] ERROR: node22:9997 reports assignment failed for tablet i6;23~1~posDataFeature~kdm~2015013;23~1~posDataFeature~kdh~20160708

2018-03-17 22:02:16,981 [master.Master] ERROR: node15:9997 reports assignment failed for tablet i6;07~1~posDataFeature~w7e~2016092;07~1~posDataFeature~w49

2018-03-17 22:02:17,194 [master.Master] ERROR: node21:9997 reports assignment failed for tablet i7;^A^@^C92dba50b-219b-46d0-932a-a12a58ca830f^@800001523;^A^@^C92da7a96-7011-41d0-82ea-1a39ae7b1a6d^@8000015959

2018-03-17 22:02:17,209 [master.Master] ERROR: node13:9997 reports assignment failed for tablet i7;^A^@^Cc05;^A^@^Cc04d6645-9a4b-42b8-a8cc-58ff19e0b957^@800001465

2018-03-17 22:02:17,574 [master.Master] ERROR: node30:9997 reports assignment failed for tablet i6;17~1~posDataFeature~e7;17~1~posDataFeature~e4r~20160509

2018-03-17 22:02:17,586 [master.Master] ERROR: node06:9997 reports assignment failed for tablet i6;01~1~posDataFeature~d3v~20160222;01~1~posDataFeature~d3f~201707072

2018-03-17 22:02:17,590 [master.Master] ERROR: node16:9997 reports assignment failed for tablet i7;^A^@^C88fe01d0-243b-487c-ab0d-30b02e8ccf69^@80000152c;^A^@^C88f2eba5-9c1f-4230-8dd2-d1a9f0f659bd^@8000015d

2018-03-17 22:02:17,617 [master.Master] ERROR: node26:9997 reports assignment failed for tablet i6;05~1~posDataFeature~d1f;05~1~posDataFeature~d0t~201507

2018-03-17 22:02:17,694 [master.Master] ERROR: node17:9997 reports assignment failed for tablet i7;^A^@^C3df85419-e448-4fa3-87b3-ec95edae204b^@80000153;^A^@^C3df47e59-05af-43e9-a650-50685f66ec0e^@80000153d

2018-03-17 22:02:17,827 [master.Master] ERROR: node33:9997 reports assignment failed for tablet i7;^A^@^C50f11344-0f80-435e-a6fa-7312619e1535^@80000142b8;^A^@^C50eb8126-a75c-4d89-9767-2a6a71c6bfac^@800001552a

2018-03-17 22:02:17,859 [master.Master] ERROR: node31:9997 reports assignment failed for tablet i6;05~0~posDataFeature~pz8;05~0~posDataFeature~my0~20141

2018-03-17 22:02:17,920 [master.Master] ERROR: node14:9997 reports assignment failed for tablet i6;14~1~posDataFeature~7ej~201606;14~1~posDataFeature~7ds~201608

2018-03-17 22:02:17,938 [master.Master] ERROR: node22:9997 reports assignment failed for tablet i6;18~1~posDataFeature~u3f~2016031;18~1~posDataFeature~u3b~20150816

2018-03-17 22:02:17,981 [master.Master] ERROR: node15:9997 reports assignment failed for tablet i6;14~0~posDataFeature~wtq~20151019;14~0~posDataFeature~wt3~20170605

2018-03-17 22:02:18,194 [master.Master] ERROR: node21:9997 reports assignment failed for tablet i6;11~0~posDataFeature~dyq~201607;11~0~posDataFeature~drm~201507

 

From /usr/local/accumulo/logs/ tserver_node11.log

 

2018-03-17 22:02:16,516 [tserver.TabletServer] INFO : adding tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 back to the assignment pool (retry 129)

2018-03-17 22:02:16,517 [tserver.TabletServer] INFO : node11:9997: got assignment from master: i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,521 [tablet.Tablet] INFO : Starting Write-Ahead Log recovery for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,521 [tserver.TabletServer] INFO : Looking for hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished

2018-03-17 22:02:16,521 [log.SortedLogRecovery] INFO : Looking at mutations from hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237 for i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : exception trying to assign tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 hdfs://master01:9000/user/accumulo/accumulo/tables/i6/t-011gdek

java.lang.RuntimeException: java.io.IOException: java.io.EOFException

        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:639)

        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)

        at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2157)

        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)

        at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)

        at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

        at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)

        at java.lang.Thread.run(Thread.java:745)

Caused by: java.io.IOException: java.io.EOFException

        at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:456)

        at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)

        at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:589)

        ... 9 more

Caused by: java.io.EOFException

        at java.io.DataInputStream.readFully(DataInputStream.java:197)

        at java.io.DataInputStream.readFully(DataInputStream.java:169)

        at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1848)

        at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1813)

        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1762)

        at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:443)

        at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)

        at org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)

        at org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)

        at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:454)

        ... 11 more

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : java.io.IOException: java.io.EOFException

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : failed to open tablet i6;00~1~posDataFeature~dwf~2016;00~1~posDataFeature~dun~201404 reporting failure to master

2018-03-17 22:02:16,526 [tserver.TabletServer] WARN : rescheduling tablet load in 600.00 seconds

 

 

The same structure error is occurring on many (if not all, all that I have so far checked) nodes across the cluster.  From what I have looked at 8bd07d5c-710f-4072-b351-8ce09d771237 appears to be a common feature, while the other elements vary.

 

The file hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237/finished exists and is zero bytes.  The folder hdfs://master01:9000/user/accumulo/accumulo/recovery/8bd07d5c-710f-4072-b351-8ce09d771237 has two further folders within, part-r-00000 and part-r-00001.  Both have files within called data and index. 

 

The data file in part-r-00000 is 1071KB ends abruptly, thus:

 

       +  2xœc```  


    f   Z


       "  3 Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05htsj1.rfxœc```  


       

       #  3xœc```  


    f   Z


       $  3 Khdfs://master01:9000/user/accumulo/accumulo/tables/i7/t-010qx03/F05huonw.rfxœc```  


       

       %  3xœc```  


    f   Z


       &  3 Khdfs://master01:9000/u

 

The index file in part-r-00000 is zero bytes.

 

The data file in part-r-00001 is 31906KB and ends thus (which looks reasonable to me):

 


          xœc```lP22ª3¨+È/vI,ItKM,)-J­+54©320´006434.b*MUI5HK1661ÒM³05Õ5ILLÓµ0KNÓM5²HL23H1624e``pc[1] .D(ÞÊ ´=I_´û#£ƒ _«


PŸƒçά


@ZÓˆ±÷šÛD ^í)•   Ì  


          xœc```jP22ª3¨+È/vI,ItKM,)-J­+54©320´006434.b*ÎS17HK1661ÒM³05Õ5±0KÑM2N5ÖMJ³02LI34KM5g``pc[1] .D(ÞÎÈÀÀ˜¤/ºo#£


ßÔç
@}
ž«š iM#ÆÞkn“ˆs[J±J*Â: s uÉ&ºII&)ºÉf‰ ©‰Ææ–f¦fp· ¡xÄm[1]å @·ñžÛÜ v[—


Üm“70Rv

 

The index file in part-t-00001 is 5KB and equally looks reasonable.

 

Any help or direction that you might be able to give would be most gratefully received.

 

Best regards,

 

Nick

 

 

This email (and any attachments) may contain confidential information and is intended solely for the recipient(s) to whom the email is addressed. If you received this email in error, please inform us immediately and delete the email and all attachments without further using, copying or disclosing the information. This email and any attachments are believed to be, but cannot be guaranteed to be, secure or virus-free. Satellite Applications Catapult Limited is registered in England & Wales. Company Number: 7964746. Registered office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire OX11 0QR.

This email (and any attachments) may contain confidential information and is intended solely for the recipient(s) to whom the email is addressed. If you received this email in error, please inform us immediately and delete the email and all attachments without further using, copying or disclosing the information. This email and any attachments are believed to be, but cannot be guaranteed to be, secure or virus-free. Satellite Applications Catapult Limited is registered in England & Wales. Company Number: 7964746. Registered office: Electron Building, Fermi Avenue, Harwell Oxford, Didcot, Oxfordshire OX11 0QR.