[GitHub] accumulo pull request #293: ACCUMULO-4669 Use windowed statistics in RFile

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #293: ACCUMULO-4669 Use windowed statistics in RFile

ctubbsii
GitHub user keith-turner opened a pull request:

    https://github.com/apache/accumulo/pull/293

    ACCUMULO-4669 Use windowed statistics in RFile

   

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/keith-turner/accumulo ACCUMULO-4669

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/293.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #293
   
----
commit 9c893f606f5cb44fc2fd2e6a4a3721e2c29edf95
Author: Keith Turner <[hidden email]>
Date:   2017-08-15T14:20:42Z

    ACCUMULO-4669 Use windowed statistics in RFile

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #293: ACCUMULO-4669 Use windowed statistics in RFile

ctubbsii
Github user mikewalch commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/293#discussion_r134035002
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java ---
    @@ -403,7 +402,8 @@ public void flushIfNeeded() throws IOException {
     
         private SampleLocalityGroupWriter sample;
     
    -    private SummaryStatistics keyLenStats = new SummaryStatistics();
    +    // Use windowed stats to fix ACCUMULO-4669
    +    private RollingStats keyLenStats = new RollingStats(2017);
    --- End diff --
   
    Why 2017 for window size?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #293: ACCUMULO-4669 Use windowed statistics in RFile

ctubbsii
In reply to this post by ctubbsii
Github user mikewalch commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/293#discussion_r134035312
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/file/rfile/RollingStats.java ---
    @@ -0,0 +1,114 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license
    + * agreements. See the NOTICE file distributed with this work for additional information regarding
    + * copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance with the License. You may obtain a
    + * copy of the License at
    + *
    + * http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software distributed under the License
    + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
    + * or implied. See the License for the specific language governing permissions and limitations under
    + * the License.
    + */
    +package org.apache.accumulo.core.file.rfile;
    +
    +import org.apache.commons.math3.stat.StatUtils;
    +import org.apache.commons.math3.util.FastMath;
    +
    +/**
    + * This class supports efficient window statistics. Apache commons math3 has a class called DescriptiveStatistics that supports windows. DescriptiveStatistics
    + * recomputes the statistics over the entire window each time its requested. In a test over 1,000,000 entries with a window size of 1019 that requested stats
    + * for each entry this class took ~50ms and DescriptiveStatistics took ~6,000ms.
    + *
    + * <p>
    + * This class may not be as accurate as DescriptiveStatistics. In unit test its within 1/1000 of DescriptiveStatistics.
    + */
    +class RollingStats {
    +  private int position;
    +  private double window[];
    +
    +  private double average;
    +  private double variance;
    +  private double stddev;
    +
    +  // indicates if the window is full
    +  private boolean windowFull;
    +
    +  private int recomputeCounter = 0;
    +
    +  RollingStats(int windowSize) {
    +    this.windowFull = false;
    +    this.position = 0;
    +    this.window = new double[windowSize];
    +  }
    +
    +  /**
    +   * @see <a href= "http://jonisalonen.com/2014/efficient-and-accurate-rolling-standard-deviation/">Efficient and accurate rolling standard deviation</a>
    +   */
    +  private void update(double n, double o, int w) {
    --- End diff --
   
    I guess `n` & `o` is for new & old.  Could instead use `newValue` & `oldValue` to make things clear.  What is `w`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #293: ACCUMULO-4669 Use windowed statistics in RFile

ctubbsii
In reply to this post by ctubbsii
Github user mikewalch commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/293#discussion_r134034922
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/file/rfile/RFile.java ---
    @@ -416,8 +416,9 @@ public void flushIfNeeded() throws IOException {
         }
     
         private boolean isGiantKey(Key k) {
    -      // consider a key thats more than 3 standard deviations from previously seen key sizes as giant
    -      return k.getSize() > keyLenStats.getMean() + keyLenStats.getStandardDeviation() * 3;
    +      double mean = keyLenStats.getMean();
    +      double stddev = keyLenStats.getStandardDeviation();
    +      return k.getSize() > mean + 3 * Math.max(mean, stddev);
    --- End diff --
   
    Could add some comments here about why this logic is used to determine if key is giant...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---