[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

classic Classic list List threaded Threaded
117 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
GitHub user keith-turner opened a pull request:

    https://github.com/apache/accumulo/pull/224

    ACCUMULO-4500 ACCUMULO-96 Added summarization

   

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/keith-turner/accumulo ACCUMULO-4501-5

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #224
   
----
commit e2c56510ed502fadaf583aaf6c08476ed6721f70
Author: Keith Turner <[hidden email]>
Date:   2017-03-01T02:56:19Z

    ACCUMULO-4500 ACCUMULO-96 Added summarization

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103731088
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/summary/summarizers/VisibilitySummarizer.java ---
    @@ -0,0 +1,37 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.summary.summarizers;
    +
    +import java.util.function.UnaryOperator;
    +
    +import org.apache.accumulo.core.client.summary.CountingSummarizer;
    +import org.apache.accumulo.core.data.ArrayByteSequence;
    +import org.apache.accumulo.core.data.ByteSequence;
    +
    +public class VisibilitySummarizer extends CountingSummarizer<ByteSequence> {
    --- End diff --
   
    needs javadoc & since tag


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103731048
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/summary/summarizers/FamilySummarizer.java ---
    @@ -0,0 +1,36 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.accumulo.core.client.summary.summarizers;
    +
    +import java.util.function.UnaryOperator;
    +
    +import org.apache.accumulo.core.client.summary.CountingSummarizer;
    +import org.apache.accumulo.core.data.ArrayByteSequence;
    +import org.apache.accumulo.core.data.ByteSequence;
    +
    +public class FamilySummarizer extends CountingSummarizer<ByteSequence> {
    --- End diff --
   
    needs javadoc & since tag


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103730257
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/summary/Summary.java ---
    @@ -0,0 +1,126 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.summary;
    +
    +import java.util.Map;
    +
    +import com.google.common.collect.ImmutableMap;
    +
    +/**
    + * This class encapsulates summary statistics, information about how those statistics were generated, and information about files the statistics were obtained
    + * from.
    + *
    + * @see Summarizer
    + */
    --- End diff --
   
    needs since tag


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103724796
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/impl/TableOperationsImpl.java ---
    @@ -1661,4 +1676,138 @@ public Locations locate(String tableName, Collection<Range> ranges) throws Accum
     
         return new LoctionsImpl(binnedRanges);
       }
    +
    +  @Override
    +  public SummaryRetriever getSummaries(String tableName) {
    +
    +    return new SummaryRetriever() {
    +
    +      private Text startRow = null;
    +      private Text endRow = null;
    +      private List<TSummarizerConfiguration> summariesToFetch = Collections.emptyList();
    +      private String summarizerClassRegex;
    +      private boolean flush = false;
    +
    +      @Override
    +      public SummaryRetriever startRow(Text startRow) {
    +        Objects.requireNonNull(startRow);
    +        if (endRow != null) {
    +          Preconditions.checkArgument(startRow.compareTo(endRow) < 0, "Start row must be less than end row : %s >= %s", startRow, endRow);
    +        }
    +        this.startRow = startRow;
    +        return this;
    +      }
    +
    +      @Override
    +      public SummaryRetriever startRow(CharSequence startRow) {
    +        return startRow(new Text(startRow.toString()));
    +      }
    +
    +      @Override
    +      public List<Summary> retrieve() throws AccumuloException, AccumuloSecurityException, TableNotFoundException {
    +        String tableId = Tables.getTableId(context.getInstance(), tableName);
    +        if (Tables.getTableState(context.getInstance(), tableId) == TableState.OFFLINE)
    +          throw new TableOfflineException(context.getInstance(), tableId);
    +
    +        TRowRange range = new TRowRange(TextUtil.getByteBuffer(startRow), TextUtil.getByteBuffer(endRow));
    +        TSummaryRequest request = new TSummaryRequest(tableId, range, summariesToFetch, summarizerClassRegex);
    +        if (flush) {
    +          _flush(tableId, startRow, endRow, true);
    +        }
    +
    +        ClientContext cct = new ClientContext(context.getInstance(), context.getCredentials(), context.getConfiguration()) {
    +          @Override
    +          public long getClientTimeoutInMillis() {
    +            return Math.max(super.getClientTimeoutInMillis(), 60 * 60 * 1000);
    --- End diff --
   
    This is forcing a minimum client timeout of 60mins which could be irritating for someone who set a 10minute timeout. Is this what you intended?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103721888
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/NewTableConfiguration.java ---
    @@ -121,8 +128,22 @@ public NewTableConfiguration setProperties(Map<String,String> prop) {
        */
       public NewTableConfiguration enableSampling(SamplerConfiguration samplerConfiguration) {
         requireNonNull(samplerConfiguration);
    -    SamplerConfigurationImpl.checkDisjoint(properties, samplerConfiguration);
    -    this.samplerConfiguration = samplerConfiguration;
    +    Map<String,String> tmp = new SamplerConfigurationImpl(samplerConfiguration).toTablePropertiesMap();
    --- End diff --
   
    Maybe a static utility to see if we can do an (unsafe) cast to `SamplerConfigurationImpl` to avoid an unnecessary object creation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103721572
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/NewTableConfiguration.java ---
    @@ -121,8 +128,22 @@ public NewTableConfiguration setProperties(Map<String,String> prop) {
        */
       public NewTableConfiguration enableSampling(SamplerConfiguration samplerConfiguration) {
         requireNonNull(samplerConfiguration);
    -    SamplerConfigurationImpl.checkDisjoint(properties, samplerConfiguration);
    -    this.samplerConfiguration = samplerConfiguration;
    +    Map<String,String> tmp = new SamplerConfigurationImpl(samplerConfiguration).toTablePropertiesMap();
    +    checkDisjoint(properties, tmp, summarizerProps);
    +    this.samplerProps = tmp;
    --- End diff --
   
    Seems odd to be checking disjoint with `summarizerProps` but then updating `samplerProps` instead. Maybe javadoc on those member variables would help describe what they each hold.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103725852
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/impl/TableOperationsImpl.java ---
    @@ -1661,4 +1676,138 @@ public Locations locate(String tableName, Collection<Range> ranges) throws Accum
     
         return new LoctionsImpl(binnedRanges);
       }
    +
    +  @Override
    +  public SummaryRetriever getSummaries(String tableName) {
    +
    +    return new SummaryRetriever() {
    +
    +      private Text startRow = null;
    +      private Text endRow = null;
    +      private List<TSummarizerConfiguration> summariesToFetch = Collections.emptyList();
    +      private String summarizerClassRegex;
    +      private boolean flush = false;
    +
    +      @Override
    +      public SummaryRetriever startRow(Text startRow) {
    +        Objects.requireNonNull(startRow);
    +        if (endRow != null) {
    +          Preconditions.checkArgument(startRow.compareTo(endRow) < 0, "Start row must be less than end row : %s >= %s", startRow, endRow);
    +        }
    +        this.startRow = startRow;
    +        return this;
    +      }
    +
    +      @Override
    +      public SummaryRetriever startRow(CharSequence startRow) {
    +        return startRow(new Text(startRow.toString()));
    +      }
    +
    +      @Override
    +      public List<Summary> retrieve() throws AccumuloException, AccumuloSecurityException, TableNotFoundException {
    +        String tableId = Tables.getTableId(context.getInstance(), tableName);
    +        if (Tables.getTableState(context.getInstance(), tableId) == TableState.OFFLINE)
    +          throw new TableOfflineException(context.getInstance(), tableId);
    +
    +        TRowRange range = new TRowRange(TextUtil.getByteBuffer(startRow), TextUtil.getByteBuffer(endRow));
    +        TSummaryRequest request = new TSummaryRequest(tableId, range, summariesToFetch, summarizerClassRegex);
    +        if (flush) {
    +          _flush(tableId, startRow, endRow, true);
    +        }
    +
    +        ClientContext cct = new ClientContext(context.getInstance(), context.getCredentials(), context.getConfiguration()) {
    +          @Override
    +          public long getClientTimeoutInMillis() {
    +            return Math.max(super.getClientTimeoutInMillis(), 60 * 60 * 1000);
    +          }
    +        };
    +        TSummaries ret = ServerClient.execute(cct, c -> c.getSummaries(Tracer.traceInfo(), context.rpcCreds(), request));
    +        return new SummaryCollection(ret).getSummaries();
    +      }
    +
    +      @Override
    +      public SummaryRetriever endRow(Text endRow) {
    +        Objects.requireNonNull(endRow);
    +        if (startRow != null) {
    +          Preconditions.checkArgument(startRow.compareTo(endRow) < 0, "Start row must be less than end row : %s >= %s", startRow, endRow);
    +        }
    +        this.endRow = endRow;
    +        return this;
    +      }
    +
    +      @Override
    +      public SummaryRetriever endRow(CharSequence endRow) {
    +        return endRow(new Text(endRow.toString()));
    +      }
    +
    +      @Override
    +      public SummaryRetriever withConfiguration(Collection<SummarizerConfiguration> configs) {
    +        Objects.requireNonNull(configs);
    +        summariesToFetch = configs.stream().map(SummarizerConfigurationUtil::toThrift).collect(Collectors.toList());
    +        return this;
    +      }
    +
    +      @Override
    +      public SummaryRetriever withConfiguration(SummarizerConfiguration... config) {
    +        Objects.requireNonNull(config);
    +        return withConfiguration(Arrays.asList(config));
    +      }
    +
    +      @Override
    +      public SummaryRetriever withMatchingConfiguration(String regex) {
    +        Objects.requireNonNull(regex);
    +        // Do a sanity check here to make sure that regex compiles, instead of having it fail on a tserver.
    +        Pattern.compile(regex);
    +        this.summarizerClassRegex = regex;
    +        return this;
    +      }
    +
    +      @Override
    +      public SummaryRetriever flush(boolean b) {
    +        this.flush = b;
    +        return this;
    +      }
    +    };
    +  }
    +
    +  @Override
    +  public void addSummarizers(String tableName, SummarizerConfiguration... newConfigs) throws AccumuloException, AccumuloSecurityException,
    +      TableNotFoundException {
    +    HashSet<SummarizerConfiguration> currentConfigs = new HashSet<>(SummarizerConfiguration.fromTableProperties(getProperties(tableName)));
    +    HashSet<SummarizerConfiguration> newConfigSet = new HashSet<>(Arrays.asList(newConfigs));
    +
    +    newConfigSet.removeIf(sc -> currentConfigs.contains(sc));
    +
    +    Set<String> newIds = newConfigSet.stream().map(sc -> sc.getPropertyId()).collect(toSet());
    --- End diff --
   
    Such wizardy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103724151
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/impl/TableOperationsImpl.java ---
    @@ -1661,4 +1676,138 @@ public Locations locate(String tableName, Collection<Range> ranges) throws Accum
     
         return new LoctionsImpl(binnedRanges);
       }
    +
    +  @Override
    +  public SummaryRetriever getSummaries(String tableName) {
    +
    +    return new SummaryRetriever() {
    --- End diff --
   
    Pull this out to a private static class instead of doing it inline? Would help testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103720966
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/NewTableConfiguration.java ---
    @@ -42,7 +46,13 @@
       private boolean limitVersion = true;
     
       private Map<String,String> properties = new HashMap<>();
    -  private SamplerConfiguration samplerConfiguration;
    +  private Map<String,String> samplerProps = Collections.emptyMap();
    +  private Map<String,String> summarizerProps = Collections.emptyMap();
    +
    +  private void checkDisjoint(Map<String,String> props, Map<String,String> sampleProps, Map<String,String> summrizerProps) {
    --- End diff --
   
    s/summrizerProps/summarizerProps/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103723011
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SummaryRetriever.java ---
    @@ -0,0 +1,107 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Collection;
    +import java.util.List;
    +
    +import org.apache.accumulo.core.client.AccumuloException;
    +import org.apache.accumulo.core.client.AccumuloSecurityException;
    +import org.apache.accumulo.core.client.TableNotFoundException;
    +import org.apache.accumulo.core.client.summary.SummarizerConfiguration;
    +import org.apache.accumulo.core.client.summary.Summary;
    +import org.apache.hadoop.io.Text;
    +
    +/**
    + * This interface allows configuring where and which summary data to retrieve before retrieving it.
    + *
    + * @since 2.0.0
    + */
    +public interface SummaryRetriever {
    +
    +  /**
    +   * Summary data is only retrieved from data that has been written to files. Data recently written to Accumulo may be in memory and there will not show up in
    +   * summary data. Setting this option to true force tablets in the range to minor compact before summary data is retrieved. By default the table will not be
    +   * flushed before retrieving summary data.
    +   *
    +   * @return this
    +   */
    +  SummaryRetriever flush(boolean shouldFlush);
    +
    +  /**
    +   * Allows optionally setting start row before retrieving data. The start row is not inclusive.
    --- End diff --
   
    Nit: Move "optionally" to it's own sentence (e.g. "The start row is not inclusive. Calling this method is optional.") or lead with it (e.g. "Optional, allows setting a start row").
   
    Same applies to the other methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103729447
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/summary/CountingSummarizer.java ---
    @@ -0,0 +1,302 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.accumulo.core.client.summary;
    +
    +import java.util.HashMap;
    +import java.util.List;
    +import java.util.Map;
    +import java.util.Map.Entry;
    +import java.util.function.Consumer;
    +import java.util.function.Function;
    +import java.util.function.UnaryOperator;
    +import java.util.stream.Collectors;
    +
    +import org.apache.accumulo.core.client.summary.summarizers.VisibilitySummarizer;
    +import org.apache.accumulo.core.data.Key;
    +import org.apache.accumulo.core.data.Value;
    +import org.apache.commons.lang.mutable.MutableLong;
    +
    +//checkstyle and formatter are in conflict
    +//@formatter:off
    +/**
    + * This class counts arbitrary keys while defending against too many keys and keys that are too long.
    + *
    + * <p>
    + * During collection and summarization this class will use the functions from {@link #converter()} and {@link #encoder()}. For each key/value the function from
    + * {@link #converter()} will be called to create zero or more counter objects. A counter associated with each counter object will be incremented, as long as
    + * there are not too many counters and the counter object is not too long.
    + *
    + * <p>
    + * When {@link Summarizer.Collector#summarize(Summarizer.StatisticConsumer)} is called, the function from {@link #encoder()} will be used to convert counter
    + * objects to strings. These strings will be used to emit statistics. Overriding {@link #encoder()} is optional. One reason to override is if the counter object
    + * contains binary or special data. For example, a function that base64 encodes counter objects could be created.
    + *
    + * <p>
    + * If the counter key type is mutable, then consider overriding {@link #copier()}.
    + *
    + * <p>
    + * The function returned by {@link #converter()} will be called frequently and should be very efficient. The function returned by {@link #encoder()} will be
    + * called less frequently and can be more expensive. The reason these two functions exists is to avoid the conversion to string for each key value, if that
    + * conversion is unnecessary.
    + *
    + * <p>
    + * Below is an example implementation that counts column visibilities. This example avoids converting column visibility to string for each key/value. This
    + * example shows the source code for {@link VisibilitySummarizer}.
    + *
    + * <pre>
    + * <code>
    + *   public class VisibilitySummarizer extends CountingSummarizer&lt;ByteSequence&gt; {
    + *     &#064;Override
    + *     protected UnaryOperator&lt;ByteSequence&gt; copier() {
    + *       // ByteSequences are mutable, so override and provide a copy function
    + *       return ArrayByteSequence::new;
    + *     }
    + *
    + *     &#064;Override
    + *     protected Converter&lt;ByteSequence&gt; converter() {
    + *       return (key, val, consumer) -&gt; consumer.accept(key.getColumnVisibilityData());
    + *     }
    + *   }
    + * </code>
    + * </pre>
    + *
    + * @param <K>
    + *          The counter key type. This type must have good implementations of {@link Object#hashCode()} and {@link Object#equals(Object)}.
    + * @see CounterSummary
    + * @since 2.0.0
    + */
    +//@formatter:on
    +public abstract class CountingSummarizer<K> implements Summarizer {
    +
    +  /**
    +   * A configuration option for specifying the maximum number of unique counters an instance of this summarizer should track. If not specified, a default of
    +   * {@value #MAX_COUNTER_DEFAULT} will be used.
    +   */
    +  public static final String MAX_COUNTERS_OPT = "maxCounters";
    +
    +  /**
    +   * A configuration option for specifying the maximum length of an individual counter key. If not specified, a default of {@value #MAX_CKL_DEFAULT} will be
    +   * used.
    +   */
    +  public static final String MAX_COUNTER_LEN_OPT = "maxCounterLen";
    +
    +  /**
    +   * A configuration option to determine if delete keys should be counted. If set to true then delete keys will not be passed to the {@link Converter} and the
    +   * statistic {@value #DELETES_IGNORED_STAT} will track the number of deleted ignored. This options defaults to {@value #INGNORE_DELETES_DEFAULT}.
    +   */
    +  public static final String INGNORE_DELETES_OPT = "ignoreDeletes";
    +
    +  /**
    +   * This prefixes all counters when emitting statistics in {@link Summarizer.Collector#summarize(Summarizer.StatisticConsumer)}.
    +   */
    +  public static final String COUNTER_STAT_PREFIX = "c:";
    +
    +  /**
    +   * This is the name of the statistic that tracks how many counters objects were ignored because the number of unique counters was exceeded. The max number of
    +   * unique counters is specified by {@link #MAX_COUNTERS_OPT}.
    +   */
    +  public static final String TOO_MANY_STAT = "tooMany";
    +
    +  /**
    +   * This is the name of the statistic that tracks how many counter objects were ignored because they were too long. The maximum lenght is specified by
    +   * {@link #MAX_COUNTER_LEN_OPT}.
    +   */
    +  public static final String TOO_LONG_STAT = "tooLong";
    +
    +  /**
    +   * This is the name of the statistic that tracks the total number of counter objects emitted by the {@link Converter}. This includes emitted Counter objects
    +   * that were ignored.
    +   */
    +  public static final String EMITTED_STAT = "emitted";
    +
    +  /**
    +   * This is the name of the statistic that tracks the total number of deleted keys seen. This statistic is only incremented when the
    +   * {@value #INGNORE_DELETES_OPT} option is set to true.
    +   */
    +  public static final String DELETES_IGNORED_STAT = "deletesIgnored";
    +
    +  /**
    +   * This tracks the total number of key/values seen by the {@link Summarizer.Collector}
    +   */
    +  public static final String SEEN_STAT = "seen";
    +
    +  // this default can not be changed as persisted summary data depends on it
    --- End diff --
   
    Can you expand on this comment?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103726746
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/mapreduce/lib/impl/FileOutputConfigurator.java ---
    @@ -209,4 +210,12 @@ public static void setSampler(Class<?> implementingClass, Configuration conf, Sa
         }
       }
     
    +  public static void setSummarizers(Class<?> implementingClass, Configuration conf, SummarizerConfiguration[] sumarizerConfigs) {
    +    Map<String,String> props = SummarizerConfiguration.toTableProperties(sumarizerConfigs);
    +
    +    for (Entry<String,String> entry : props.entrySet()) {
    +      conf.set(enumToConfKey(implementingClass, Opts.ACCUMULO_PROPERTIES) + "." + entry.getKey(), entry.getValue());
    --- End diff --
   
    Use a `StringBuilder` to cut down on transient `String`s you're generating.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103726395
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/mapred/AccumuloFileOutputFormat.java ---
    @@ -136,6 +138,20 @@ public static void setSampler(JobConf job, SamplerConfiguration samplerConfig) {
         FileOutputConfigurator.setSampler(CLASS, job, samplerConfig);
       }
     
    +  /**
    +   * Specify a list of summarizer configurations to create summary data in the output file. Each Key Value written will be passed to the configured
    --- End diff --
   
    s/Specify/Specifies/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103731310
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/summary/SummarizerConfiguration.java ---
    @@ -0,0 +1,238 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.accumulo.core.client.summary;
    +
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.Collection;
    +import java.util.Collections;
    +import java.util.Map;
    +import java.util.Map.Entry;
    +
    +import org.apache.accumulo.core.summary.SummarizerConfigurationUtil;
    +
    +import com.google.common.base.Preconditions;
    +import com.google.common.collect.ImmutableMap;
    +import com.google.common.hash.Hasher;
    +import com.google.common.hash.Hashing;
    +
    +/**
    + * This class encapsulates the configuration needed to instantiate a {@link Summarizer}. It also provides methods and documentation for setting the table
    + * properties that configure a Summarizer.
    + *
    + * @since 2.0.0
    + */
    +public class SummarizerConfiguration {
    +
    +  private final String className;
    +  private final Map<String,String> options;
    +  private int hashCode = 0;
    +  private final String configId;
    +
    +  private SummarizerConfiguration(String className, String configId, Map<String,String> options) {
    +    this.className = className;
    +    this.options = ImmutableMap.copyOf(options);
    +
    +    if (configId == null) {
    +      ArrayList<String> keys = new ArrayList<>(this.options.keySet());
    +      Collections.sort(keys);
    +      Hasher hasher = Hashing.murmur3_32().newHasher();
    +      hasher.putString(className);
    +      for (String key : keys) {
    +        hasher.putString(key);
    +        hasher.putString(options.get(key));
    +      }
    +
    +      this.configId = hasher.hash().toString();
    +    } else {
    +      this.configId = configId;
    +    }
    +  }
    +
    +  /**
    +   * @return the name of a class that implements @link {@link Summarizer}.
    +   */
    +  public String getClassName() {
    +    return className;
    +  }
    +
    +  /**
    +   * @return custom options for a {link @Summarizer}
    +   */
    +  public Map<String,String> getOptions() {
    +    return options;
    +  }
    +
    +  /**
    +   * The propertyId is used to when creating table properties for a summarizer. Its not used for equality or hashCode for this class.
    +   */
    +  public String getPropertyId() {
    +    return configId;
    +  }
    +
    +  @Override
    +  public String toString() {
    +    return className + " " + configId + " " + options;
    +  }
    +
    +  /**
    +   * Compares the classname and options to determine equality.
    +   */
    +  @Override
    +  public boolean equals(Object o) {
    +    if (o instanceof SummarizerConfiguration) {
    +      SummarizerConfiguration osc = (SummarizerConfiguration) o;
    +      return className.equals(osc.className) && options.equals(osc.options);
    +    }
    +
    +    return false;
    +  }
    +
    +  /**
    +   * Hashes the classname and options to create a hashcode.
    +   */
    +  @Override
    +  public int hashCode() {
    +    if (hashCode == 0) {
    +      hashCode = 31 * options.hashCode() + className.hashCode();
    +    }
    +    return hashCode;
    +  }
    +
    +  /**
    +   * Converts this configuration to Accumulo per table properties. The returned map has the following key values. The {@code <configId>} below is from
    +   * {@link #getPropertyId()}. The {@code <optionKey>} and {@code <optionValue>} below are derived from the key values of {@link #getOptions()}.
    +   *
    +   * <pre>
    +   * {@code
    +   *   table.summarizer.<configId>=<classname>
    +   *   table.summarizer.<configId>.opt.<optionKey1>=<optionValue1>
    +   *   table.summarizer.<configId>.opt.<optionKey2>=<optionValue2>
    +   *      .
    +   *      .
    +   *      .
    +   *   table.summarizer.<configId>.opt.<optionKeyN>=<optionValueN>
    +   * }
    +   * </pre>
    +   */
    +  public Map<String,String> toTableProperties() {
    +    return SummarizerConfigurationUtil.toTablePropertiesMap(Collections.singletonList(this));
    +  }
    +
    +  /**
    +   * Encodes each configuration in the same way as {@link #toTableProperties()}.
    +   *
    +   * @throws IllegalArgumentException
    +   *           when there are duplicate values for {@link #getPropertyId()}
    +   */
    +  public static Map<String,String> toTableProperties(SummarizerConfiguration... configurations) {
    +    return SummarizerConfigurationUtil.toTablePropertiesMap(Arrays.asList(configurations));
    +  }
    +
    +  /**
    +   * Encodes each configuration in the same way as {@link #toTableProperties()}.
    +   *
    +   * @throws IllegalArgumentException
    +   *           when there are duplicate values for {@link #getPropertyId()}
    +   */
    +  public static Map<String,String> toTableProperties(Collection<SummarizerConfiguration> configurations) {
    +    return SummarizerConfigurationUtil.toTablePropertiesMap(new ArrayList<SummarizerConfiguration>(configurations));
    +  }
    +
    +  /**
    +   * Decodes table properties with the prefix {@code table.summarizer} into {@link SummarizerConfiguration} objects. Table properties with prefixes other than
    +   * {@code table.summarizer} are ignored.
    +   */
    +  public static Collection<SummarizerConfiguration> fromTableProperties(Map<String,String> props) {
    +    return fromTableProperties(props.entrySet());
    +  }
    +
    +  /**
    +   * @see #fromTableProperties(Map)
    +   */
    +  public static Collection<SummarizerConfiguration> fromTableProperties(Iterable<Entry<String,String>> props) {
    +    return SummarizerConfigurationUtil.getSummarizerConfigs(props);
    +  }
    +
    +  public static class Builder {
    +    private String className;
    +    private ImmutableMap.Builder<String,String> imBuilder;
    +    private String configId = null;
    +
    +    private Builder(String className) {
    +      this.className = className;
    +      this.imBuilder = ImmutableMap.builder();
    +    }
    +
    +    /**
    +     * Setting this is optional. If not set, an id is generated using hashing that will likely be unique.
    --- End diff --
   
    3rd-person declarative, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103722621
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SummaryRetriever.java ---
    @@ -0,0 +1,107 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Collection;
    +import java.util.List;
    +
    +import org.apache.accumulo.core.client.AccumuloException;
    +import org.apache.accumulo.core.client.AccumuloSecurityException;
    +import org.apache.accumulo.core.client.TableNotFoundException;
    +import org.apache.accumulo.core.client.summary.SummarizerConfiguration;
    +import org.apache.accumulo.core.client.summary.Summary;
    +import org.apache.hadoop.io.Text;
    +
    +/**
    + * This interface allows configuring where and which summary data to retrieve before retrieving it.
    + *
    + * @since 2.0.0
    + */
    +public interface SummaryRetriever {
    +
    +  /**
    +   * Summary data is only retrieved from data that has been written to files. Data recently written to Accumulo may be in memory and there will not show up in
    --- End diff --
   
    Using 3-person declarative (the style used on the other methods) is good for Javadoc, IMO.
   
    e.g. `Forces a flush of data in tablets before summary data is retrieved. ....:


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103731270
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/summary/SummarizerConfiguration.java ---
    @@ -0,0 +1,238 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.accumulo.core.client.summary;
    +
    +import java.util.ArrayList;
    +import java.util.Arrays;
    +import java.util.Collection;
    +import java.util.Collections;
    +import java.util.Map;
    +import java.util.Map.Entry;
    +
    +import org.apache.accumulo.core.summary.SummarizerConfigurationUtil;
    +
    +import com.google.common.base.Preconditions;
    +import com.google.common.collect.ImmutableMap;
    +import com.google.common.hash.Hasher;
    +import com.google.common.hash.Hashing;
    +
    +/**
    + * This class encapsulates the configuration needed to instantiate a {@link Summarizer}. It also provides methods and documentation for setting the table
    + * properties that configure a Summarizer.
    + *
    + * @since 2.0.0
    + */
    +public class SummarizerConfiguration {
    +
    +  private final String className;
    +  private final Map<String,String> options;
    +  private int hashCode = 0;
    +  private final String configId;
    +
    +  private SummarizerConfiguration(String className, String configId, Map<String,String> options) {
    +    this.className = className;
    +    this.options = ImmutableMap.copyOf(options);
    +
    +    if (configId == null) {
    +      ArrayList<String> keys = new ArrayList<>(this.options.keySet());
    +      Collections.sort(keys);
    +      Hasher hasher = Hashing.murmur3_32().newHasher();
    +      hasher.putString(className);
    +      for (String key : keys) {
    +        hasher.putString(key);
    +        hasher.putString(options.get(key));
    +      }
    +
    +      this.configId = hasher.hash().toString();
    +    } else {
    +      this.configId = configId;
    +    }
    +  }
    +
    +  /**
    +   * @return the name of a class that implements @link {@link Summarizer}.
    +   */
    +  public String getClassName() {
    +    return className;
    +  }
    +
    +  /**
    +   * @return custom options for a {link @Summarizer}
    +   */
    +  public Map<String,String> getOptions() {
    +    return options;
    +  }
    +
    +  /**
    +   * The propertyId is used to when creating table properties for a summarizer. Its not used for equality or hashCode for this class.
    +   */
    +  public String getPropertyId() {
    +    return configId;
    +  }
    +
    +  @Override
    +  public String toString() {
    +    return className + " " + configId + " " + options;
    +  }
    +
    +  /**
    +   * Compares the classname and options to determine equality.
    +   */
    +  @Override
    +  public boolean equals(Object o) {
    +    if (o instanceof SummarizerConfiguration) {
    +      SummarizerConfiguration osc = (SummarizerConfiguration) o;
    +      return className.equals(osc.className) && options.equals(osc.options);
    +    }
    +
    +    return false;
    +  }
    +
    +  /**
    +   * Hashes the classname and options to create a hashcode.
    +   */
    +  @Override
    +  public int hashCode() {
    +    if (hashCode == 0) {
    +      hashCode = 31 * options.hashCode() + className.hashCode();
    +    }
    +    return hashCode;
    +  }
    +
    +  /**
    +   * Converts this configuration to Accumulo per table properties. The returned map has the following key values. The {@code <configId>} below is from
    +   * {@link #getPropertyId()}. The {@code <optionKey>} and {@code <optionValue>} below are derived from the key values of {@link #getOptions()}.
    +   *
    +   * <pre>
    +   * {@code
    +   *   table.summarizer.<configId>=<classname>
    +   *   table.summarizer.<configId>.opt.<optionKey1>=<optionValue1>
    +   *   table.summarizer.<configId>.opt.<optionKey2>=<optionValue2>
    +   *      .
    +   *      .
    +   *      .
    +   *   table.summarizer.<configId>.opt.<optionKeyN>=<optionValueN>
    +   * }
    +   * </pre>
    +   */
    +  public Map<String,String> toTableProperties() {
    +    return SummarizerConfigurationUtil.toTablePropertiesMap(Collections.singletonList(this));
    +  }
    +
    +  /**
    +   * Encodes each configuration in the same way as {@link #toTableProperties()}.
    +   *
    +   * @throws IllegalArgumentException
    +   *           when there are duplicate values for {@link #getPropertyId()}
    +   */
    +  public static Map<String,String> toTableProperties(SummarizerConfiguration... configurations) {
    +    return SummarizerConfigurationUtil.toTablePropertiesMap(Arrays.asList(configurations));
    +  }
    +
    +  /**
    +   * Encodes each configuration in the same way as {@link #toTableProperties()}.
    +   *
    +   * @throws IllegalArgumentException
    +   *           when there are duplicate values for {@link #getPropertyId()}
    +   */
    +  public static Map<String,String> toTableProperties(Collection<SummarizerConfiguration> configurations) {
    +    return SummarizerConfigurationUtil.toTablePropertiesMap(new ArrayList<SummarizerConfiguration>(configurations));
    +  }
    +
    +  /**
    +   * Decodes table properties with the prefix {@code table.summarizer} into {@link SummarizerConfiguration} objects. Table properties with prefixes other than
    +   * {@code table.summarizer} are ignored.
    +   */
    +  public static Collection<SummarizerConfiguration> fromTableProperties(Map<String,String> props) {
    +    return fromTableProperties(props.entrySet());
    +  }
    +
    +  /**
    +   * @see #fromTableProperties(Map)
    +   */
    +  public static Collection<SummarizerConfiguration> fromTableProperties(Iterable<Entry<String,String>> props) {
    +    return SummarizerConfigurationUtil.getSummarizerConfigs(props);
    +  }
    +
    +  public static class Builder {
    +    private String className;
    +    private ImmutableMap.Builder<String,String> imBuilder;
    +    private String configId = null;
    +
    +    private Builder(String className) {
    +      this.className = className;
    +      this.imBuilder = ImmutableMap.builder();
    +    }
    +
    +    /**
    +     * Setting this is optional. If not set, an id is generated using hashing that will likely be unique.
    +     *
    +     * @param propId
    +     *          This is id is used when converting a {@link SummarizerConfiguration} to table properties. Since tables can have multiple summarizers, make sure
    +     *          its unique.
    +     */
    +    public Builder setPropertyId(String propId) {
    +      Preconditions.checkArgument(propId.matches("\\w+"), "Config Id %s is not alphanum", propId);
    +      this.configId = propId;
    +      return this;
    +    }
    +
    +    public Builder addOption(String key, String value) {
    --- End diff --
   
    These methods need javadoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103726513
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/mapreduce/AccumuloFileOutputFormat.java ---
    @@ -134,6 +136,20 @@ public static void setSampler(Job job, SamplerConfiguration samplerConfig) {
         FileOutputConfigurator.setSampler(CLASS, job.getConfiguration(), samplerConfig);
       }
     
    +  /**
    +   * Specify a list of summarizer configurations to create summary data in the output file. Each Key Value written will be passed to the configured
    --- End diff --
   
    s/Specify/Specifies/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103723297
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SummaryRetriever.java ---
    @@ -0,0 +1,107 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Collection;
    +import java.util.List;
    +
    +import org.apache.accumulo.core.client.AccumuloException;
    +import org.apache.accumulo.core.client.AccumuloSecurityException;
    +import org.apache.accumulo.core.client.TableNotFoundException;
    +import org.apache.accumulo.core.client.summary.SummarizerConfiguration;
    +import org.apache.accumulo.core.client.summary.Summary;
    +import org.apache.hadoop.io.Text;
    +
    +/**
    + * This interface allows configuring where and which summary data to retrieve before retrieving it.
    + *
    + * @since 2.0.0
    + */
    +public interface SummaryRetriever {
    +
    +  /**
    +   * Summary data is only retrieved from data that has been written to files. Data recently written to Accumulo may be in memory and there will not show up in
    +   * summary data. Setting this option to true force tablets in the range to minor compact before summary data is retrieved. By default the table will not be
    +   * flushed before retrieving summary data.
    +   *
    +   * @return this
    +   */
    +  SummaryRetriever flush(boolean shouldFlush);
    +
    +  /**
    +   * Allows optionally setting start row before retrieving data. The start row is not inclusive.
    +   */
    +  SummaryRetriever startRow(Text startRow);
    +
    +  /**
    +   * Allows optionally setting start row before retrieving data. The start row is not inclusive.
    +   */
    +  SummaryRetriever startRow(CharSequence startRow);
    +
    +  /**
    +   * Allows optionally setting end row before retrieving data. The end row is inclusive.
    +   */
    +  SummaryRetriever endRow(Text endRow);
    +
    +  /**
    +   * Allows optionally setting end row before retrieving data. The end row is inclusive.
    +   */
    +  SummaryRetriever endRow(CharSequence endRow);
    +
    +  /**
    +   * Gets summaries generated with a configuration that matches the given regex. For a given SummarizationConfiguration it is matched in exactly the following
    +   * way. This allows the regex to match on classname and options.
    --- End diff --
   
    It's not obvious to me what this method is doing under the hood and I don't think the code snippet helps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #224: ACCUMULO-4500 ACCUMULO-96 Added summarization

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/224#discussion_r103723706
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/TableOperations.java ---
    @@ -808,4 +812,64 @@ void setSamplerConfiguration(String tableName, SamplerConfiguration samplerConfi
        * @since 1.8.0
        */
       SamplerConfiguration getSamplerConfiguration(String tableName) throws TableNotFoundException, AccumuloException, AccumuloSecurityException;
    +
    +  /**
    +   * This is a entry point for retrieving summaries with optional restrictions.
    +   *
    +   * <p>
    +   * Inorder to retrieve Summaries, the Accumulo user making the request will need the {@link TablePermission#GET_SUMMARIES} table permission.
    +   *
    +   * <p>
    +   * Accumulo stores summary data with each file in each tablet. In order to make retrieving it faster there is a per tablet server cache of summary data. The
    +   * size of this cache is determined by the property {code tserver.cache.summary.size}. When summary data for a file is not present, it will be retrieved using
    +   * threads on the tserver. The property {@code tserver.summary.retrieval.threads} determines the max number of threads the tserver will use for this.
    +   *
    +   * <p>
    +   * Since summary data is cached, its important to use the summary selection options to only read the needed data into the cache.
    +   *
    +   * <p>
    +   * Summary data will be merged on the tablet servers and then in this client process. Therefore its important that the required summarizers are on the clients
    --- End diff --
   
    s/its/it's/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
1234 ... 6