Re: No Recovery Node In Zookeeper

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: No Recovery Node In Zookeeper

Keith Turner
There is a bigger problem here.  This code is trying to place a
zookeeper watcher on the recovery node in zookeeper.  Its doing this
so that changes to the nodes children will trigger the master to take
action related to recovery.  If the watcher is not put into place, the
recoveries may not proceed as fast as they could.  I looked for
references to Constants.ZRECOVERY in the code and do not see one place
where recovery is always created.  It seems to be created on an as
needed basis.   One solution may be to modify the upgrade and init
code to create this node in zookeeper.  This way its always there and
can be watched.

I would advise waiting for Eric to chime in on this, since he just
made a huge amount of changes to the log recovery code.

A general zookeeper coding tip.  Calling exists() and then calling
getData() or getChildren(), can lead to a race condition.  It possible
that the node could exists when you call exists(), but then be deleted
by another process before you call getData() or getChildren().  The
best way to deal with this is the following pattern.

try{
   getChildren() //or getData() etc.
}catch(NoNodeException nne){
   //the node does not exists, handle that case... no race condition
}


On Tue, Jun 12, 2012 at 10:25 PM, David Medinets
<[hidden email]> wrote:

> I am greping source left and right but am not sure what to make of
> this error. Here is the code from Master.java:
>
>    ZooReaderWriter.getInstance().getChildren(zroot +
> Constants.ZRECOVERY, new Watcher() {
>      @Override
>      public void process(WatchedEvent event) {
>        nextEvent.event("Noticed recovery changes", event.getType());
>      }
>    });
>
> I suggest replacing the above code with this:
>
>    final String recoveryPath = zroot + Constants.ZRECOVERY;
>    Stat stat =
> ZooReaderWriter.getInstance().getZooKeeper().exists(recoveryPath,
> null);
>    if (stat != null && stat.getNumChildren() > 0) {
>      ZooReaderWriter.getInstance().getChildren(recoveryPath, new Watcher() {
>        @Override
>        public void process(WatchedEvent event) {
>          nextEvent.event("Noticed recovery changes", event.getType());
>        }
>      });
>    }
>
> I have changed my local Accumulo and this change seems to be Ok.
> However, since this is a change to Accumulo itself, I would like
> someone to code review before I commit this change. Does this change
> make sense?
>
> On Mon, Jun 11, 2012 at 9:54 PM, David Medinets
> <[hidden email]> wrote:
>> I am slowly working my way through whatever went wrong on my system.
>> This is the latest. I've deleted the logs and started the master by
>> hand:
>>
>> accumulo org.apache.accumulo.server.master.state.SetGoalState NORMAL
>> start-server.sh localhost master
>>
>> Then checked the log files where I saw this message:
>>
>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode
>> = NoNode for /accumulo/b519799c-3a51-4c9b-af21-96d577e2c11f/recovery
>>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>        at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1448)
>>        at org.apache.accumulo.core.zookeeper.ZooReader.getChildren(ZooReader.java:62)
>>        at org.apache.accumulo.server.master.Master.run(Master.java:2071)
>>        at org.apache.accumulo.server.master.Master.main(Master.java:2173)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>        at java.lang.reflect.Method.invoke(Method.java:601)
>>
>> I've run out of time for debugging today. I'll dig into the source
>> code more tomorrow ... until someone can point me in the right
>> direction to resolve this?