Quantcast

Tserver kills themselves from lost Zookeeper locks

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Tserver kills themselves from lost Zookeeper locks

buttercream
I'm having a consistent problem with my tservers being killed by the master. One by one they turn off over the course of a week or so. In the tserver log, I see a debug message right before the shutdown that says:
Unexpected event watching lock node WatchedEvent state:Disconnected type:None path: null

Looking at the timing of what led up to this, the tserver has the following events:
02:03:37,922 DEBUG: gc
02:03:38,341 DEBUG: UpSess <ipaddress>
02:03:38,585 DEBUG: MultiScanSess
02:03:38,596 DEBUG: MultiScanSess
02:03:38,601 DEBUG: MultiScanSess
02:04:14,007 DEBUG: Unexpected event

In the master log, I see the KeeperException$ConnectionLossException to the machine that occurs at 02:04:14,017

So, is this indicative of just a timeout and if so, is there a setting in the master to increase the timeout?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

Eric Newton
You may be running into this issue:

https://issues.apache.org/jira/browse/ACCUMULO-1572

However, it is normally unusual to get a disconnected event from zookeeper.

Could you describe your environment a bit? Please provide versions of hadoop, zookeeper, accumulo, OS and native/virtual execution.

-Eric



On Mon, Nov 4, 2013 at 2:24 PM, buttercream <[hidden email]> wrote:
I'm having a consistent problem with my tservers being killed by the master.
One by one they turn off over the course of a week or so. In the tserver
log, I see a debug message right before the shutdown that says:
Unexpected event watching lock node WatchedEvent state:Disconnected
type:None path: null

Looking at the timing of what led up to this, the tserver has the following
events:
02:03:37,922 DEBUG: gc
02:03:38,341 DEBUG: UpSess <ipaddress>
02:03:38,585 DEBUG: MultiScanSess
02:03:38,596 DEBUG: MultiScanSess
02:03:38,601 DEBUG: MultiScanSess
02:04:14,007 DEBUG: Unexpected event

In the master log, I see the KeeperException$ConnectionLossException to the
machine that occurs at 02:04:14,017

So, is this indicative of just a timeout and if so, is there a setting in
the master to increase the timeout?



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125.html
Sent from the Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

buttercream
I'm running a 3 node development environment cluster, 1 master and 2 tablet servers. I'm running Accumulo 1.5 with MapR M3. Zookeeper version 2.1.3.2. I am in a virtual environment running RHEL 6.

It usually takes about 5 days after starting the cluster for the tservers to start dying.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

Eric Newton
Hmm.  That's sounds like a very old version of zookeeper... the earliest version we recommend is 3.3.6.
I have never even heard of zookeeper version 2.1.3.2.  Is it specific to the MapR environment?

Virtual environments can be tricky.  There's a second level of possible swap that can take place, so it's possible you are getting something pushed out of memory: either the tserver or a virtual OS running the tserver.

You may want to look into increasing your zookeeper timeouts in this environment.

-Eric





On Mon, Nov 4, 2013 at 4:08 PM, buttercream <[hidden email]> wrote:
I'm running a 3 node development environment cluster, 1 master and 2 tablet
servers. I'm running Accumulo 1.5 with MapR M3. Zookeeper version 2.1.3.2. I
am in a virtual environment running RHEL 6.

It usually takes about 5 days after starting the cluster for the tservers to
start dying.



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125p6131.html
Sent from the Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

buttercream
Sorry, that was a total lie. We're using 3.3.6.

For the timeout setting, is that something within the accumulo-site.xml? I'm having trouble finding a property to set for this value.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

Josh Elser
In reply to this post by Eric Newton
On 11/4/13, 6:52 PM, Eric Newton wrote:
> Hmm.  That's sounds like a very old version of zookeeper... the earliest
> version we recommend is 3.3.6.
> I have never even heard of zookeeper version 2.1.3.2.  Is it specific to
> the MapR environment?

That looks like a version of MapR, not ZooKeeper. MapR's release notes
don't list the version for us either:
http://doc.mapr.com/display/RelNotes/Version+2.1.3.2+Release+Notes

>
> Virtual environments can be tricky.  There's a second level of possible
> swap that can take place, so it's possible you are getting something
> pushed out of memory: either the tserver or a virtual OS running the
> tserver.
>
> You may want to look into increasing your zookeeper timeouts in this
> environment.

 From the Accumulo side: `instance.zookeeper.timeout` in
accumulo-site.xml. That variable defaults to 30 seconds and controls the
timeout length for ZooKeeper operations.

>
> -Eric
>
>
>
>
>
> On Mon, Nov 4, 2013 at 4:08 PM, buttercream
> <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>     I'm running a 3 node development environment cluster, 1 master and 2
>     tablet
>     servers. I'm running Accumulo 1.5 with MapR M3. Zookeeper version
>     2.1.3.2. I
>     am in a virtual environment running RHEL 6.
>
>     It usually takes about 5 days after starting the cluster for the
>     tservers to
>     start dying.
>
>
>
>     --
>     View this message in context:
>     http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125p6131.html
>     Sent from the Users mailing list archive at Nabble.com.
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

Eric Newton
Note that the maximum timeout is 40s due to the default zookeeper configuration.

The maximum timeout can only be 20 "ticks" and the default tick time for zookeeper is 2s.

Still, 40s is a looong time to wait.  You may want to establish that your VMs and your system configurations are not pushing into swap.

-Eric



On Mon, Nov 4, 2013 at 7:42 PM, Josh Elser <[hidden email]> wrote:
On 11/4/13, 6:52 PM, Eric Newton wrote:
Hmm.  That's sounds like a very old version of zookeeper... the earliest
version we recommend is 3.3.6.
I have never even heard of zookeeper version 2.1.3.2.  Is it specific to
the MapR environment?

That looks like a version of MapR, not ZooKeeper. MapR's release notes don't list the version for us either: http://doc.mapr.com/display/RelNotes/Version+2.1.3.2+Release+Notes



Virtual environments can be tricky.  There's a second level of possible
swap that can take place, so it's possible you are getting something
pushed out of memory: either the tserver or a virtual OS running the
tserver.

You may want to look into increasing your zookeeper timeouts in this
environment.

From the Accumulo side: `instance.zookeeper.timeout` in accumulo-site.xml. That variable defaults to 30 seconds and controls the timeout length for ZooKeeper operations.


-Eric





On Mon, Nov 4, 2013 at 4:08 PM, buttercream
<[hidden email] <mailto:[hidden email]>>

wrote:

    I'm running a 3 node development environment cluster, 1 master and 2
    tablet
    servers. I'm running Accumulo 1.5 with MapR M3. Zookeeper version
    2.1.3.2. I
    am in a virtual environment running RHEL 6.

    It usually takes about 5 days after starting the cluster for the
    tservers to
    start dying.



    --
    View this message in context:
    http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125p6131.html
    Sent from the Users mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

buttercream
Our machines currently have 8GB of memory, so I'll push that up and see if the error occurs again. Also, are there any recommended settings changes to the garbage collection that would cause it to be more efficient or at least less intrusive?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

Eric Newton
Limiting the size of the New Generation can avoid long pauses during New Generation collection:

"-XX:NewSize=1G -XX:MaxNewSize=1G"

And, of course, using the Concurrent collector, tweaked so it kicks in early enough to avoid stop-the-world pauses:

"-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75"

Both of these are in the example configurations, so you may be using them already.

-Eric



On Tue, Nov 5, 2013 at 12:44 PM, buttercream <[hidden email]> wrote:
Our machines currently have 8GB of memory, so I'll push that up and see if
the error occurs again. Also, are there any recommended settings changes to
the garbage collection that would cause it to be more efficient or at least
less intrusive?



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125p6150.html
Sent from the Users mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

buttercream
I increased all of the servers up to 32GB of memory and confirmed that I have the flags that you mentioned in the env file. Unfortunately within a day I lost one of the tservers. In the tserver logs, looking at the timestamps leading up to the event, I see:
02:00:03,835 [cache.LruBlockCache]
02:00:51,580 [tabletserver.TabletServer] DEBUG: MultiScanSess
02:01:02,267 [tabletserver.TabletServer] FATAL: Lost tablet server lock (reason = LOCK_DELETED), exiting.

What's interesting on this one is that in the master log file, there is no error message at that time. What I do see is this:
02:01:02,168 [master.Master] DEBUG: Finished gathering information from 2 servers in 0.01 seconds

That would mean the tserver killed itself within milliseconds of the master getting the information successfully. Any thoughts on this one?
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

Josh Elser
Was there actually an 11 second delay in the tserver's debug log
(2:00:51 to 2:01:02) or did you omit some log statements?

The log messages in your original email also showed MultiScanSession(s)
immediately before the ZK lock lost.

Can you give us any information about the type of query workload you're
servicing here? A MultiScanSession is the equivalent to a "piece" of a
BatchScanner running against a tserver. Are you doing any sort of heavy
workload in an SortedKeyValueIterator running on these tservers?

On 11/12/13, 9:36 AM, buttercream wrote:

> I increased all of the servers up to 32GB of memory and confirmed that I have
> the flags that you mentioned in the env file. Unfortunately within a day I
> lost one of the tservers. In the tserver logs, looking at the timestamps
> leading up to the event, I see:
> 02:00:03,835 [cache.LruBlockCache]
> 02:00:51,580 [tabletserver.TabletServer] DEBUG: MultiScanSess
> 02:01:02,267 [tabletserver.TabletServer] FATAL: Lost tablet server lock
> (reason = LOCK_DELETED), exiting.
>
> What's interesting on this one is that in the master log file, there is no
> error message at that time. What I do see is this:
> 02:01:02,168 [master.Master] DEBUG: Finished gathering information from 2
> servers in 0.01 seconds
>
> That would mean the tserver killed itself within milliseconds of the master
> getting the information successfully. Any thoughts on this one?
>
>
>
> --
> View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125p6360.html
> Sent from the Users mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

buttercream
I did not omit any log messages.

The interesting thing is that the query load really isn't that bad (or at least I perceive it to not be bad). I'm just doing direct lookups on individual rows based on rowID. At most that would be about 2k at a time. I would need to dig through some other logs to see if I could see if there was an actual batch scan happening at that time or not. I usually don't realize there is a problem until the system stops responding and I check the master log and see that it shows no tablet servers running.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

John Vines-3

As Eric suggested before, make sure things aren't being pushed into swap. An 11 seconds delay is definitely indicative of it. Be sure to ensure both the tserver process itself as well as your VM as a whole is not swapping. Mind you, things may swap not wholly as a symptom of full physical memory, but in an attempt for the os to be "helpful".

Sent from my phone, please pardon the typos and brevity.

On Nov 16, 2013 9:07 PM, "buttercream" <[hidden email]> wrote:
I did not omit any log messages.

The interesting thing is that the query load really isn't that bad (or at
least I perceive it to not be bad). I'm just doing direct lookups on
individual rows based on rowID. At most that would be about 2k at a time. I
would need to dig through some other logs to see if I could see if there was
an actual batch scan happening at that time or not. I usually don't realize
there is a problem until the system stops responding and I check the master
log and see that it shows no tablet servers running.



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125p6485.html
Sent from the Users mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Tserver kills themselves from lost Zookeeper locks

Sean Busbey-3
You may also want to try decreasing your memory allocations. For example, if you are using native maps try just giving the tablet server 1-2GB of memory in accumulo-env.sh.

Can you find the following:

accumulo-env.sh:

* Tablet Server Java max heap

accumulo-site.xml:
* tserver.memory.maps.max
* tserver.cache.data.size
* tserver.cache.index.size

In the past, the two most common misconfigurations I've seen are

1) Using native maps (the default) while not taking into account the memory required by them (tserver.memory.maps.max).

This usually comes up when the host starts swapping under pressure.

2) giving tablet servers lots of memory in accumulo-env, but not increasing the cache sizes

This usually happens when people leave their accumulo-site.xml as defaults but give the tablet server 12-16GB of memory in accumulo-env.sh, this means that the JVM just burns extra space and then when GC finally happens it causes time outs.

Can you get us full logs? Pref for both the Accumulo bits and ZooKeeper?



On Sat, Nov 16, 2013 at 8:42 PM, John Vines <[hidden email]> wrote:

As Eric suggested before, make sure things aren't being pushed into swap. An 11 seconds delay is definitely indicative of it. Be sure to ensure both the tserver process itself as well as your VM as a whole is not swapping. Mind you, things may swap not wholly as a symptom of full physical memory, but in an attempt for the os to be "helpful".

Sent from my phone, please pardon the typos and brevity.

On Nov 16, 2013 9:07 PM, "buttercream" <[hidden email]> wrote:
I did not omit any log messages.

The interesting thing is that the query load really isn't that bad (or at
least I perceive it to not be bad). I'm just doing direct lookups on
individual rows based on rowID. At most that would be about 2k at a time. I
would need to dig through some other logs to see if I could see if there was
an actual batch scan happening at that time or not. I usually don't realize
there is a problem until the system stops responding and I check the master
log and see that it shows no tablet servers running.



--
View this message in context: http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125p6485.html
Sent from the Users mailing list archive at Nabble.com.



--
Sean
Loading...