Fwd: Large numbers of authorizations

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Large numbers of authorizations

mdladakos

---------- Forwarded message ----------
From: Michael Ladakos <[hidden email]>
Date: Fri, Mar 23, 2018 at 12:32 PM
Subject: Large numbers of authorizations
To: [hidden email]


I am somewhat new to Accumulo and was doing some experimentation on consequences for using large numbers of authorizations.

I found that a user with a large set of authorizations would take a great deal of time to perform a scan. I tested at various increments up to 100,000 authorizations. At that point, it would take at least 25 seconds to perform the scan, even if the table was newly created with no rows.

Performing a scan with a small subset of authorizations is equivalent to performing a query with a user that only has a small number of authorizations.

I attempted to find the place in the code where whatever is being done, because I wanted to understand what caused this, but I wasn't able to track down the exact class. Any chance I could get an explanation or pointed in the right direction?

Thanks!

Reply | Threaded
Open this post in threaded view
|

Re: Large numbers of authorizations

Keith Turner
This is the code that scans use to filter based on column visibility
and authorizations.  It has a cache of previously seen column
visibilities and the decision that was made for those.

  https://github.com/apache/accumulo/blob/2e171cdb8420f817ff9ebeb23f9d8a70b0878ca5/core/src/main/java/org/apache/accumulo/core/iterators/system/VisibilityFilter.java

The following code does the evaluation.

  https://github.com/apache/accumulo/blob/f81a8ec7410e789d11941351d5899b8894c6a322/core/src/main/java/org/apache/accumulo/core/security/VisibilityEvaluator.java

On Fri, Mar 23, 2018 at 1:10 PM, Michael Ladakos <[hidden email]> wrote:

>
> ---------- Forwarded message ----------
> From: Michael Ladakos <[hidden email]>
> Date: Fri, Mar 23, 2018 at 12:32 PM
> Subject: Large numbers of authorizations
> To: [hidden email]
>
>
> I am somewhat new to Accumulo and was doing some experimentation on
> consequences for using large numbers of authorizations.
>
> I found that a user with a large set of authorizations would take a great
> deal of time to perform a scan. I tested at various increments up to 100,000
> authorizations. At that point, it would take at least 25 seconds to perform
> the scan, even if the table was newly created with no rows.
>
> Performing a scan with a small subset of authorizations is equivalent to
> performing a query with a user that only has a small number of
> authorizations.
>
> I attempted to find the place in the code where whatever is being done,
> because I wanted to understand what caused this, but I wasn't able to track
> down the exact class. Any chance I could get an explanation or pointed in
> the right direction?
>
> Thanks!
>
Reply | Threaded
Open this post in threaded view
|

Re: Large numbers of authorizations

mdladakos
Keith, thanks for your quick response!

Maybe I wasn't clear enough or I am not understanding your explanation.

What I was exploring was performing a scan with a large number of
authorizations. While I did use tables with thousands of rows, I also ran
scans against empty tables and still performed at ~25 Seconds. So shouldn't
VisibilityEvaluator not be in involved?

I don't think the actual filtering is the problem. Is there some work done
by the tablet servers when receiving the scan request, specifically in
regard to user authorizations?

Again, if I used -s to pass a subset of authorizations for the user with
100000 authorizations, this increase in return time would be equivalent to a
user with that number of authorizations (i.e.: If I scanned with 100
authorizations out of the 100000, it would be the normal, fast speed)



--
Sent from: http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html
Reply | Threaded
Open this post in threaded view
|

Re: Large numbers of authorizations

Keith Turner
On Fri, Mar 23, 2018 at 4:06 PM, mdladakos <[hidden email]> wrote:
> Keith, thanks for your quick response!
>
> Maybe I wasn't clear enough or I am not understanding your explanation.
>
> What I was exploring was performing a scan with a large number of
> authorizations. While I did use tables with thousands of rows, I also ran
> scans against empty tables and still performed at ~25 Seconds. So shouldn't
> VisibilityEvaluator not be in involved?

Gotcha.  So one possibility is its just taking a while to send the
auths from the client to the tserver.  The following code is the
thrift RPC to start a scan.  client.startScan() is passed
scanState.authorizations.getAuthorizationsBB() which is the auths.
The getAuthorizationsBB() method does a copy.  So there is a copy,
then thrift has to serialize auths, send them, and then deserialize on
server side.. and this is done for each startScan RPC.  The startScan
call happens once per tablet, subsequent batches of key/vals from a
tablet are fetched using contunueScan RPC which does not pass auths
again.

https://github.com/apache/accumulo/blob/1e4d4827096bd0047c7de3e0b672263defe66634/core/src/main/java/org/apache/accumulo/core/client/impl/ThriftScanner.java#L429

It would be interesting to see how long the call to startScan takes
for your case.  Enabling trace logging for ThriftScanner will give
some insight into this.

>
> I don't think the actual filtering is the problem. Is there some work done
> by the tablet servers when receiving the scan request, specifically in
> regard to user authorizations?
>
> Again, if I used -s to pass a subset of authorizations for the user with
> 100000 authorizations, this increase in return time would be equivalent to a
> user with that number of authorizations (i.e.: If I scanned with 100
> authorizations out of the 100000, it would be the normal, fast speed)
>
>
>
> --
> Sent from: http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html
Reply | Threaded
Open this post in threaded view
|

Re: Large numbers of authorizations

Keith Turner
In reply to this post by mdladakos
On Fri, Mar 23, 2018 at 4:06 PM, mdladakos <[hidden email]> wrote:

> Keith, thanks for your quick response!
>
> Maybe I wasn't clear enough or I am not understanding your explanation.
>
> What I was exploring was performing a scan with a large number of
> authorizations. While I did use tables with thousands of rows, I also ran
> scans against empty tables and still performed at ~25 Seconds. So shouldn't
> VisibilityEvaluator not be in involved?
>
> I don't think the actual filtering is the problem. Is there some work done
> by the tablet servers when receiving the scan request, specifically in
> regard to user authorizations?
>
> Again, if I used -s to pass a subset of authorizations for the user with
> 100000 authorizations, this increase in return time would be equivalent to a
> user with that number of authorizations (i.e.: If I scanned with 100
> authorizations out of the 100000, it would be the normal, fast speed)

I think the following code may be the problem. The collection
userauths is a list, so performance will O(M*N).  Is M and N are 100K,
then its not good.  If userauths were a set this would be much faster
for the case you are testing.

https://github.com/apache/accumulo/blob/17bc708dcabd17824a8378597e0542002470ed18/server/base/src/main/java/org/apache/accumulo/server/security/handler/ZKAuthorizor.java#L166
>
>
>
> --
> Sent from: http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html
Reply | Threaded
Open this post in threaded view
|

Re: Large numbers of authorizations

Keith Turner
On Fri, Mar 23, 2018 at 11:55 PM, Keith Turner <[hidden email]> wrote:

> On Fri, Mar 23, 2018 at 4:06 PM, mdladakos <[hidden email]> wrote:
>> Keith, thanks for your quick response!
>>
>> Maybe I wasn't clear enough or I am not understanding your explanation.
>>
>> What I was exploring was performing a scan with a large number of
>> authorizations. While I did use tables with thousands of rows, I also ran
>> scans against empty tables and still performed at ~25 Seconds. So shouldn't
>> VisibilityEvaluator not be in involved?
>>
>> I don't think the actual filtering is the problem. Is there some work done
>> by the tablet servers when receiving the scan request, specifically in
>> regard to user authorizations?
>>
>> Again, if I used -s to pass a subset of authorizations for the user with
>> 100000 authorizations, this increase in return time would be equivalent to a
>> user with that number of authorizations (i.e.: If I scanned with 100
>> authorizations out of the 100000, it would be the normal, fast speed)
>
> I think the following code may be the problem. The collection
> userauths is a list, so performance will O(M*N).  Is M and N are 100K,
> then its not good.  If userauths were a set this would be much faster
> for the case you are testing.
>
> https://github.com/apache/accumulo/blob/17bc708dcabd17824a8378597e0542002470ed18/server/base/src/main/java/org/apache/accumulo/server/security/handler/ZKAuthorizor.java#L166

This code is called on the server side to check if the auth passed by
a scan are valid.

>>
>>
>>
>> --
>> Sent from: http://apache-accumulo.1065345.n5.nabble.com/Users-f2.html
Reply | Threaded
Open this post in threaded view
|

Re: Large numbers of authorizations

Dong Zhou-2
Depending on how long each authorization string is. You might run into zookeeper znode storage limit issue.

jute.maxbuffer:

(Java system property: jute.maxbuffer)

This option can only be set as a Java system property. There is no zookeeper prefix on it. It specifies the maximum size of the data that can be stored in a znode. The default is 0xfffff, or just under 1M. If this option is changed, the system property must be set on all servers and clients otherwise problems will arise. This is really a sanity check. ZooKeeper is designed to store data on the order of kilobytes in size.


On Fri, Mar 23, 2018 at 8:58 PM, Keith Turner <[hidden email]> wrote:
On Fri, Mar 23, 2018 at 11:55 PM, Keith Turner <[hidden email]> wrote:
> On Fri, Mar 23, 2018 at 4:06 PM, mdladakos <[hidden email]> wrote:
>> Keith, thanks for your quick response!
>>
>> Maybe I wasn't clear enough or I am not understanding your explanation.
>>
>> What I was exploring was performing a scan with a large number of
>> authorizations. While I did use tables with thousands of rows, I also ran
>> scans against empty tables and still performed at ~25 Seconds. So shouldn't
>> VisibilityEvaluator not be in involved?
>>
>> I don't think the actual filtering is the problem. Is there some work done
>> by the tablet servers when receiving the scan request, specifically in
>> regard to user authorizations?
>>
>> Again, if I used -s to pass a subset of authorizations for the user with
>> 100000 authorizations, this increase in return time would be equivalent to a
>> user with that number of authorizations (i.e.: If I scanned with 100
>> authorizations out of the 100000, it would be the normal, fast speed)
>
> I think the following code may be the problem. The collection
> userauths is a list, so performance will O(M*N).  Is M and N are 100K,
> then its not good.  If userauths were a set this would be much faster
> for the case you are testing.
>
> https://github.com/apache/accumulo/blob/17bc708dcabd17824a8378597e0542002470ed18/server/base/src/main/java/org/apache/accumulo/server/security/handler/ZKAuthorizor.java#L166

This code is called on the server side to check if the auth passed by
a scan are valid.

Reply | Threaded
Open this post in threaded view
|

Re: Large numbers of authorizations

Keith Turner
In reply to this post by mdladakos
I found and fixed this issue.

https://github.com/apache/accumulo/pull/410

On Fri, Mar 23, 2018 at 1:10 PM, Michael Ladakos <[hidden email]> wrote:

>
> ---------- Forwarded message ----------
> From: Michael Ladakos <[hidden email]>
> Date: Fri, Mar 23, 2018 at 12:32 PM
> Subject: Large numbers of authorizations
> To: [hidden email]
>
>
> I am somewhat new to Accumulo and was doing some experimentation on
> consequences for using large numbers of authorizations.
>
> I found that a user with a large set of authorizations would take a great
> deal of time to perform a scan. I tested at various increments up to 100,000
> authorizations. At that point, it would take at least 25 seconds to perform
> the scan, even if the table was newly created with no rows.
>
> Performing a scan with a small subset of authorizations is equivalent to
> performing a query with a user that only has a small number of
> authorizations.
>
> I attempted to find the place in the code where whatever is being done,
> because I wanted to understand what caused this, but I wasn't able to track
> down the exact class. Any chance I could get an explanation or pointed in
> the right direction?
>
> Thanks!
>