Why may "tablet read ahead" take long time? (was: Profile a (batch) scan)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Why may "tablet read ahead" take long time? (was: Profile a (batch) scan)

Maxim Kolchin
Hi all,

I try to trace some scans with Zipkin and see that quite often the trace called "tablet read ahead" takes 10x or 100x more time than the other similar traces.

Why it may happen? What could be done to reduce the time? I found a similar discussion on the list, but it doesn't have an answer. I'd be great to have a how-to article listing some steps which could be done.

Attaching a screenshot of one of the traces having this issue.

Maxim Kolchin

E-mail: [hidden email]
Tel.: +7 (911) 199-55-73
Homepage: http://kolchinmax.ru

Below you can find a good example of what I'm struggling to understand
right now. It's a trace for a simple scan over some columns with a
BatchScanner using 75 threads. The scan takes 877 milliseconds and the main
contributor is the entry "tablet read ahead 1", which starts at 248 ms.
These are the questions that I cannot answer with this trace:

   1. why this heavy operation starts after 248ms? By summing up the delay
   before this operation you get a number which is not even close to 248ms.
   2. what does "tablet read ahead 1" means? In general, how to map the
   entries of a trace to their meaning? Is there a guide about this?
   3. why "tablet read ahead 1" takes 600ms? It's clearly not the sum of
   the entries under this one but that's the important part.
   4. I may be naive but...how much data have been read by this scan? How
   many entries? That's very important to understand what's going on.

Thanks for the help,

Mario

877+ 0 Dice@h01 counts
2+ 7 tserver@h12 startScan
6+ 10 tserver@h15 startScan
5+ 11 tserver@h15 metadata tablets read ahead 4
843+ 34 Dice@h01 batch scanner 74- 1
620+ 230 tserver@h09 startMultiScan
600+ 248 tserver@h09 tablet read ahead 1
22+ 299 tserver@h09 newDFSInputStream
22+ 299 tserver@h09 getBlockLocations
2+ 310 tserver@h09 ClientNamenodeProtocol#getBlockLocations
1+ 321 tserver@h09 getFileInfo
1+ 321 tserver@h09 ClientNamenodeProtocol#getFileInfo
2+ 322 tserver@h09 DFSInputStream#byteArrayRead
1+ 324 tserver@h09 DFSInputStream#byteArrayRead
2+ 831 tserver@h09 DFSInputStream#byteArrayRead
2+ 834 tserver@h09 DFSInputStream#byteArrayRead
1+ 835 tserver@h09 BlockReaderLocal#fillBuffer(1091850413)
1+ 874 tserver@h09 closeMultiScan
-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: [hidden email]
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

Screenshot_2019-01-15 Zipkin - Traces.png (584K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Why may "tablet read ahead" take long time? (was: Profile a (batch) scan)

Adam Fuchs
Hi Maxim,

What you're seeing is an artifact of the threading model that Accumulo uses. When you launch a query, Accumulo tablet servers will coordinate RPCs via Thrift in one thread pool (which grows unbounded) and queue up scans (rfile lookups, decryption/decompression, iterators, etc.) in another threadpool known as the readahead pool (which has a fixed number of threads). You're seeing everything that happens in that readahead thread in one big chunk. You may need to look a bit deeper into profiling/sampling tablet server CPU to get insights into how to improve your query performance. If you want to speed up queries in general you might try (in no particular order):
1. Increase parallelism by bumping up the readahead threads (tserver.readahead.concurrent.max). This will still be bounded by the number of parallel scans clients are driving.
2. Increase parallelism driven by clients by querying more, smaller ranges, or by splitting tablets.
3. Increase scan batch sizes if the readahead thread or thrift coordination overhead is high.
4. Optimize custom iterators if that is a CPU bottleneck.
5. Increase cache sizes or otherwise modify queries to improve cache hit rates.
6. Change compression settings if that is a CPU bottleneck. Try snappy instead of gz.

Cheers,
Adam

On Tue, Jan 15, 2019, 10:45 AM Maxim Kolchin <[hidden email] wrote:
Hi all,

I try to trace some scans with Zipkin and see that quite often the trace called "tablet read ahead" takes 10x or 100x more time than the other similar traces.

Why it may happen? What could be done to reduce the time? I found a similar discussion on the list, but it doesn't have an answer. I'd be great to have a how-to article listing some steps which could be done.

Attaching a screenshot of one of the traces having this issue.

Maxim Kolchin

E-mail: [hidden email]
Tel.: +7 (911) 199-55-73
Homepage: http://kolchinmax.ru

Below you can find a good example of what I'm struggling to understand
right now. It's a trace for a simple scan over some columns with a
BatchScanner using 75 threads. The scan takes 877 milliseconds and the main
contributor is the entry "tablet read ahead 1", which starts at 248 ms.
These are the questions that I cannot answer with this trace:

   1. why this heavy operation starts after 248ms? By summing up the delay
   before this operation you get a number which is not even close to 248ms.
   2. what does "tablet read ahead 1" means? In general, how to map the
   entries of a trace to their meaning? Is there a guide about this?
   3. why "tablet read ahead 1" takes 600ms? It's clearly not the sum of
   the entries under this one but that's the important part.
   4. I may be naive but...how much data have been read by this scan? How
   many entries? That's very important to understand what's going on.

Thanks for the help,

Mario

877+ 0 Dice@h01 counts
2+ 7 tserver@h12 startScan
6+ 10 tserver@h15 startScan
5+ 11 tserver@h15 metadata tablets read ahead 4
843+ 34 Dice@h01 batch scanner 74- 1
620+ 230 tserver@h09 startMultiScan
600+ 248 tserver@h09 tablet read ahead 1
22+ 299 tserver@h09 newDFSInputStream
22+ 299 tserver@h09 getBlockLocations
2+ 310 tserver@h09 ClientNamenodeProtocol#getBlockLocations
1+ 321 tserver@h09 getFileInfo
1+ 321 tserver@h09 ClientNamenodeProtocol#getFileInfo
2+ 322 tserver@h09 DFSInputStream#byteArrayRead
1+ 324 tserver@h09 DFSInputStream#byteArrayRead
2+ 831 tserver@h09 DFSInputStream#byteArrayRead
2+ 834 tserver@h09 DFSInputStream#byteArrayRead
1+ 835 tserver@h09 BlockReaderLocal#fillBuffer(1091850413)
1+ 874 tserver@h09 closeMultiScan
-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: [hidden email]
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.
Reply | Threaded
Open this post in threaded view
|

Re: Why may "tablet read ahead" take long time? (was: Profile a (batch) scan)

Maxim Kolchin
Hi Adam,

Thank you for the list!

In my case, there is only one client which runs a single query which I'm tracing, so I hope the thread pools have free threads.

I'd like to look at the code which runs in the scope of the "table read ahead". Do you know where I should look at? I tried to search by keywords on GitHub, but it wasn't able to find a string "table read ahead" which is given to the tracer.

Cheers,
Maxim

On Tue, Jan 15, 2019 at 9:01 PM Adam Fuchs <[hidden email]> wrote:
Hi Maxim,

What you're seeing is an artifact of the threading model that Accumulo uses. When you launch a query, Accumulo tablet servers will coordinate RPCs via Thrift in one thread pool (which grows unbounded) and queue up scans (rfile lookups, decryption/decompression, iterators, etc.) in another threadpool known as the readahead pool (which has a fixed number of threads). You're seeing everything that happens in that readahead thread in one big chunk. You may need to look a bit deeper into profiling/sampling tablet server CPU to get insights into how to improve your query performance. If you want to speed up queries in general you might try (in no particular order):
1. Increase parallelism by bumping up the readahead threads (tserver.readahead.concurrent.max). This will still be bounded by the number of parallel scans clients are driving.
2. Increase parallelism driven by clients by querying more, smaller ranges, or by splitting tablets.
3. Increase scan batch sizes if the readahead thread or thrift coordination overhead is high.
4. Optimize custom iterators if that is a CPU bottleneck.
5. Increase cache sizes or otherwise modify queries to improve cache hit rates.
6. Change compression settings if that is a CPU bottleneck. Try snappy instead of gz.

Cheers,
Adam

On Tue, Jan 15, 2019, 10:45 AM Maxim Kolchin <[hidden email] wrote:
Hi all,

I try to trace some scans with Zipkin and see that quite often the trace called "tablet read ahead" takes 10x or 100x more time than the other similar traces.

Why it may happen? What could be done to reduce the time? I found a similar discussion on the list, but it doesn't have an answer. I'd be great to have a how-to article listing some steps which could be done.

Attaching a screenshot of one of the traces having this issue.

Maxim Kolchin

E-mail: [hidden email]
Tel.: +7 (911) 199-55-73
Homepage: http://kolchinmax.ru

Below you can find a good example of what I'm struggling to understand
right now. It's a trace for a simple scan over some columns with a
BatchScanner using 75 threads. The scan takes 877 milliseconds and the main
contributor is the entry "tablet read ahead 1", which starts at 248 ms.
These are the questions that I cannot answer with this trace:

   1. why this heavy operation starts after 248ms? By summing up the delay
   before this operation you get a number which is not even close to 248ms.
   2. what does "tablet read ahead 1" means? In general, how to map the
   entries of a trace to their meaning? Is there a guide about this?
   3. why "tablet read ahead 1" takes 600ms? It's clearly not the sum of
   the entries under this one but that's the important part.
   4. I may be naive but...how much data have been read by this scan? How
   many entries? That's very important to understand what's going on.

Thanks for the help,

Mario

877+ 0 Dice@h01 counts
2+ 7 tserver@h12 startScan
6+ 10 tserver@h15 startScan
5+ 11 tserver@h15 metadata tablets read ahead 4
843+ 34 Dice@h01 batch scanner 74- 1
620+ 230 tserver@h09 startMultiScan
600+ 248 tserver@h09 tablet read ahead 1
22+ 299 tserver@h09 newDFSInputStream
22+ 299 tserver@h09 getBlockLocations
2+ 310 tserver@h09 ClientNamenodeProtocol#getBlockLocations
1+ 321 tserver@h09 getFileInfo
1+ 321 tserver@h09 ClientNamenodeProtocol#getFileInfo
2+ 322 tserver@h09 DFSInputStream#byteArrayRead
1+ 324 tserver@h09 DFSInputStream#byteArrayRead
2+ 831 tserver@h09 DFSInputStream#byteArrayRead
2+ 834 tserver@h09 DFSInputStream#byteArrayRead
1+ 835 tserver@h09 BlockReaderLocal#fillBuffer(1091850413)
1+ 874 tserver@h09 closeMultiScan
-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: [hidden email]
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.
Reply | Threaded
Open this post in threaded view
|

Re: Why may "tablet read ahead" take long time? (was: Profile a (batch) scan)

Keith Turner
Maxim,

Below are some pointers into the code you asked about.  Let me know if
you have any questions about the code.

The following tablet server code puts a task on a thread pool for a batch scan.

https://github.com/apache/accumulo/blob/rel/1.9.2/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServer.java#L776

This thread pool is created here.

https://github.com/apache/accumulo/blob/rel/1.9.2/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java#L227

The function that creates the thread pool calls the following code to
wrap the thread pool

https://github.com/apache/accumulo/blob/rel/1.9.2/server/tserver/src/main/java/org/apache/accumulo/tserver/TabletServerResourceManager.java#L111

Keith

On Wed, Jan 16, 2019 at 4:09 AM Maxim Kolchin <[hidden email]> wrote:

>
> Hi Adam,
>
> Thank you for the list!
>
> In my case, there is only one client which runs a single query which I'm tracing, so I hope the thread pools have free threads.
>
> I'd like to look at the code which runs in the scope of the "table read ahead". Do you know where I should look at? I tried to search by keywords on GitHub, but it wasn't able to find a string "table read ahead" which is given to the tracer.
>
> Cheers,
> Maxim
>
> On Tue, Jan 15, 2019 at 9:01 PM Adam Fuchs <[hidden email]> wrote:
>>
>> Hi Maxim,
>>
>> What you're seeing is an artifact of the threading model that Accumulo uses. When you launch a query, Accumulo tablet servers will coordinate RPCs via Thrift in one thread pool (which grows unbounded) and queue up scans (rfile lookups, decryption/decompression, iterators, etc.) in another threadpool known as the readahead pool (which has a fixed number of threads). You're seeing everything that happens in that readahead thread in one big chunk. You may need to look a bit deeper into profiling/sampling tablet server CPU to get insights into how to improve your query performance. If you want to speed up queries in general you might try (in no particular order):
>> 1. Increase parallelism by bumping up the readahead threads (tserver.readahead.concurrent.max). This will still be bounded by the number of parallel scans clients are driving.
>> 2. Increase parallelism driven by clients by querying more, smaller ranges, or by splitting tablets.
>> 3. Increase scan batch sizes if the readahead thread or thrift coordination overhead is high.
>> 4. Optimize custom iterators if that is a CPU bottleneck.
>> 5. Increase cache sizes or otherwise modify queries to improve cache hit rates.
>> 6. Change compression settings if that is a CPU bottleneck. Try snappy instead of gz.
>>
>> Cheers,
>> Adam
>>
>> On Tue, Jan 15, 2019, 10:45 AM Maxim Kolchin <[hidden email] wrote:
>>>
>>> Hi all,
>>>
>>> I try to trace some scans with Zipkin and see that quite often the trace called "tablet read ahead" takes 10x or 100x more time than the other similar traces.
>>>
>>> Why it may happen? What could be done to reduce the time? I found a similar discussion on the list, but it doesn't have an answer. I'd be great to have a how-to article listing some steps which could be done.
>>>
>>> Attaching a screenshot of one of the traces having this issue.
>>>
>>> Maxim Kolchin
>>>
>>> E-mail: [hidden email]
>>> Tel.: +7 (911) 199-55-73
>>> Homepage: http://kolchinmax.ru
>>>
>>>> Below you can find a good example of what I'm struggling to understand
>>>> right now. It's a trace for a simple scan over some columns with a
>>>> BatchScanner using 75 threads. The scan takes 877 milliseconds and the main
>>>> contributor is the entry "tablet read ahead 1", which starts at 248 ms.
>>>> These are the questions that I cannot answer with this trace:
>>>>
>>>>    1. why this heavy operation starts after 248ms? By summing up the delay
>>>>    before this operation you get a number which is not even close to 248ms.
>>>>    2. what does "tablet read ahead 1" means? In general, how to map the
>>>>    entries of a trace to their meaning? Is there a guide about this?
>>>>    3. why "tablet read ahead 1" takes 600ms? It's clearly not the sum of
>>>>    the entries under this one but that's the important part.
>>>>    4. I may be naive but...how much data have been read by this scan? How
>>>>    many entries? That's very important to understand what's going on.
>>>>
>>>> Thanks for the help,
>>>>
>>>> Mario
>>>>
>>>> 877+ 0 Dice@h01 counts
>>>> 2+ 7 tserver@h12 startScan
>>>> 6+ 10 tserver@h15 startScan
>>>> 5+ 11 tserver@h15 metadata tablets read ahead 4
>>>> 843+ 34 Dice@h01 batch scanner 74- 1
>>>> 620+ 230 tserver@h09 startMultiScan
>>>> 600+ 248 tserver@h09 tablet read ahead 1
>>>> 22+ 299 tserver@h09 newDFSInputStream
>>>> 22+ 299 tserver@h09 getBlockLocations
>>>> 2+ 310 tserver@h09 ClientNamenodeProtocol#getBlockLocations
>>>> 1+ 321 tserver@h09 getFileInfo
>>>> 1+ 321 tserver@h09 ClientNamenodeProtocol#getFileInfo
>>>> 2+ 322 tserver@h09 DFSInputStream#byteArrayRead
>>>> 1+ 324 tserver@h09 DFSInputStream#byteArrayRead
>>>> 2+ 831 tserver@h09 DFSInputStream#byteArrayRead
>>>> 2+ 834 tserver@h09 DFSInputStream#byteArrayRead
>>>> 1+ 835 tserver@h09 BlockReaderLocal#fillBuffer(1091850413)
>>>> 1+ 874 tserver@h09 closeMultiScan
>>>> --
>>>> Mario Pastorelli | TERALYTICS
>>>>
>>>> *software engineer*
>>>>
>>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>>> phone: +41794381682
>>>> email: [hidden email]
>>>> www.teralytics.net
>>>>
>>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>>> Zurich
>>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
>>>> de Vries
>>>>
>>>> This e-mail message contains confidential information which is for the sole
>>>> attention and use of the intended recipient. Please notify us at once if
>>>> you think that it may not be intended for you and delete it immediately.