[GitHub] accumulo pull request #180: Taking a crack at new summarization API

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
GitHub user keith-turner opened a pull request:

    https://github.com/apache/accumulo/pull/180

    Taking a crack at new summarization API

    I finally came up with an API for ACCUMULO-4501 that I like.  I think it offers the following three things :
   
     * The ability to write super efficient summarizers
     * The ability to write very arbitrary summarizations
     * Gives the user explicit control over how summary info is merged.
   
    The pull request just contains a possible API.   I am creating the PR just to review the APIs. It also contains two example summarizes.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/keith-turner/accumulo ACCUMULO-4501-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/180.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #180
   
----
commit 654d20ce488e981748b980c5d57247b4da9296d2
Author: Keith Turner <[hidden email]>
Date:   2016-11-07T20:44:41Z

    Taking a crack at new summarization API

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87075263
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/TableSummary.java ---
    @@ -0,0 +1,59 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Collections;
    +import java.util.Map;
    +
    +public class TableSummary {
    +  public enum Status {
    +    /**
    +     * Requested summary was computed for all data
    +     */
    +    PRESENT,
    +
    +    /**
    +     * Requested summary were computed for some data.
    +     */
    +    PARTIALLY_PRESENT,
    --- End diff --
   
    This would be for data in memory not summarized in a file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo issue #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on the issue:

    https://github.com/apache/accumulo/pull/180
 
    Took a quick glance and this looks very concise. The concrete examples are also super-duper helpful.
   
    One thing that was in the back of my mind is how to support "bounded summaries". For example, we would want to avoid storing 1M CVs if a user had that many in a table (for some reason). Could/should this be a primitive on summaries (how to limit them)? Or, would it be some implementation detail which we could provide some common-implementation logic?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87076122
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/summary/ColumnVisibilitySummarizer.java ---
    @@ -0,0 +1,87 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.summary;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import org.apache.accumulo.core.data.ByteSequence;
    +import org.apache.accumulo.core.data.Key;
    +import org.apache.accumulo.core.data.Value;
    +import org.apache.commons.lang.mutable.MutableLong;
    +
    +public class ColumnVisibilitySummarizer implements KeyValueSummarizer {
    +
    +  private static final int MAX = 5000;
    +  private static final String PREFIX = "cv:";
    +  private static final String IGNORE_KEY = "ignored";
    +
    +  // map for computing summary incrementally stores information in efficient form
    +  private Map<ByteSequence,MutableLong> summary = new HashMap<>();
    +  private long ignored = 0;
    +
    +  @Override
    +  public String getId() {
    +    return "accumulo.cvsummary";
    +  }
    +
    +  @Override
    +  public void collect(Key k, Value v) {
    +    ByteSequence cv = k.getColumnVisibilityData();
    +
    +    MutableLong ml = summary.get(cv);
    +    if (ml == null) {
    +      if (summary.size() >= MAX) {
    +        ignored++;
    +      } else {
    +        summary.put(cv, new MutableLong(1));
    --- End diff --
   
    Yeah, not sure perf wise. I know that LongAdder is supposed to be better than AtomicLong WRT perf, but not sure how big of an impact it has over single-threaded cases :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo issue #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on the issue:

    https://github.com/apache/accumulo/pull/180
 
    @rweeks this is an attempt to generalize what we discussed at the accumulo summit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87076330
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/TableSummary.java ---
    @@ -0,0 +1,59 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Collections;
    +import java.util.Map;
    +
    +public class TableSummary {
    +  public enum Status {
    +    /**
    +     * Requested summary was computed for all data
    +     */
    +    PRESENT,
    +
    +    /**
    +     * Requested summary were computed for some data.
    +     */
    +    PARTIALLY_PRESENT,
    --- End diff --
   
    Gotcha -- thanks.
   
    Would data resident only in memory be omitted by design? In other words, summaries are defined to only apply to RFiles and thus we don't need special API constructs to represent that condition. I think that makes sense.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo issue #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on the issue:

    https://github.com/apache/accumulo/pull/180
 
    > For example, we would want to avoid storing 1M CVs if a user had that many in a table (for some reason).
   
    I think we should address this issue in some way while considering the following.
   
     * Fetching summaries should be relatively fast.  Gigantic summaries will stymie this goal.
     * When a users summarizer does produce a gigantic summary, it would be nice if we helped them debug it.
   
    I am thinking one way to accomplish these goals is to store gigantic summaries, but only read summaries under a certain size.  The size of a serialized summary could be written first.  When a summary is read this size will be the first bit of info.  If the summary is over a certain size an error could be logged and that file would be treated like it had no summary.  We could also add a enum that indicates gigantic summaries were present.  Since the summary is stored, it would give the user a chance to use rfile-info to look at whats in the summary for debugging.
   
    We also need to stress in the javadoc that summaries are intended to be small.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user mjwall commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87471051
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/TableOperations.java ---
    @@ -808,4 +808,11 @@ void setSamplerConfiguration(String tableName, SamplerConfiguration samplerConfi
        * @since 1.8.0
        */
       SamplerConfiguration getSamplerConfiguration(String tableName) throws TableNotFoundException, AccumuloException, AccumuloSecurityException;
    +
    +  /**
    +   *
    +   */
    +  default SummaryRetriever summaries() {
    --- End diff --
   
    Is this way you envision getting getting the summary info?  So summaries are per table, that seems right to me


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user mjwall commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87470048
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SummaryRetriever.java ---
    @@ -0,0 +1,46 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Map;
    +
    +import org.apache.hadoop.io.Text;
    +
    +public interface SummaryRetriever {
    +
    +  /**
    +   * Allows optionally setting start row before retrieving data.
    +   */
    +  SummaryRetriever start(Text startRow);
    --- End diff --
   
    What are you thinking about here?  What if your start row is in the middle of a split?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87477044
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/TableOperations.java ---
    @@ -808,4 +808,11 @@ void setSamplerConfiguration(String tableName, SamplerConfiguration samplerConfi
        * @since 1.8.0
        */
       SamplerConfiguration getSamplerConfiguration(String tableName) throws TableNotFoundException, AccumuloException, AccumuloSecurityException;
    +
    +  /**
    +   *
    +   */
    +  default SummaryRetriever summaries() {
    --- End diff --
   
    yeah.. I forgot to include the table name as a parameter, I intended to do that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87477228
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SummaryRetriever.java ---
    @@ -0,0 +1,46 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Map;
    +
    +import org.apache.hadoop.io.Text;
    +
    +public interface SummaryRetriever {
    +
    +  /**
    +   * Allows optionally setting start row before retrieving data.
    +   */
    +  SummaryRetriever start(Text startRow);
    --- End diff --
   
    I was thinking the first table it overlapped, summaries would be retrieved for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87478881
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SummaryRetriever.java ---
    @@ -0,0 +1,46 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Map;
    +
    +import org.apache.hadoop.io.Text;
    +
    +public interface SummaryRetriever {
    +
    +  /**
    +   * Allows optionally setting start row before retrieving data.
    +   */
    +  SummaryRetriever start(Text startRow);
    --- End diff --
   
    I was also thinking that multiple summaries could be kept per rfile. That could work something like the following.
   
     * After 1,000 key values are written create a Tuple\<end row, summary\>
     * Add this tuple to an in memory list of the form List\<Tuble\<endrow, summary\>> called summaryList.  
     * If summaryList exceeds 10, then merge as merge(summaryList[0],  summaryList[1]),  merge(summaryList[2],  summaryList[3]), etc... halving the size of the list.
   
    So basically we could easily build more granular summary info within a file that, does not exceed a fixed size, by leveraging the merge functionality of a summarizer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user mjwall commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87478497
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SummaryRetriever.java ---
    @@ -0,0 +1,46 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Map;
    +
    +import org.apache.hadoop.io.Text;
    +
    +public interface SummaryRetriever {
    +
    +  /**
    +   * Allows optionally setting start row before retrieving data.
    +   */
    +  SummaryRetriever start(Text startRow);
    --- End diff --
   
    Makes sense to me


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user mjwall commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87478852
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/TableSummary.java ---
    @@ -0,0 +1,59 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Collections;
    +import java.util.Map;
    +
    +public class TableSummary {
    +  public enum Status {
    +    /**
    +     * Requested summary was computed for all data
    +     */
    +    PRESENT,
    +
    +    /**
    +     * Requested summary were computed for some data.
    +     */
    +    PARTIALLY_PRESENT,
    --- End diff --
   
    We talked a little about duplicate counts being returned, when an rfile is referenced by more than one tablet in the summarized range.  Is there an easy way to note a result included duplicate counts, or is that something we just document.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user mjwall commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87479084
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/summary/ColumnVisibilitySummarizer.java ---
    @@ -0,0 +1,91 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.summary;
    +
    +import java.util.HashMap;
    +import java.util.Map;
    +
    +import org.apache.accumulo.core.data.ByteSequence;
    +import org.apache.accumulo.core.data.Key;
    +import org.apache.accumulo.core.data.Value;
    +import org.apache.commons.lang.mutable.MutableLong;
    +
    +public class ColumnVisibilitySummarizer implements KeyValueSummarizer {
    +
    +  private static final int MAX = 5000;
    +  private static final String PREFIX = "cv:";
    +  private static final String IGNORE_KEY = "ignored";
    +
    +  // Map used for computing summary incrementally uses ByteSequence for key which is more efficient than converting colvis to String for each Key. The
    +  // conversion to String is deferred until the summary is requested.  This shows how the interface enables users to write efficient summarizers.
    +  private Map<ByteSequence,MutableLong> summary = new HashMap<>();
    +  private long ignored = 0;
    +
    +  @Override
    +  public String getId() {
    +    return "accumulo.cvsummary";
    +  }
    +
    +  @Override
    +  public void collect(Key k, Value v) {
    --- End diff --
   
    Is this the function that would be used when an rfile is compacted to aggregate the counts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r87481569
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SummaryRetriever.java ---
    @@ -0,0 +1,46 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Map;
    +
    +import org.apache.hadoop.io.Text;
    +
    +public interface SummaryRetriever {
    +
    +  /**
    +   * Allows optionally setting start row before retrieving data.
    +   */
    +  SummaryRetriever start(Text startRow);
    --- End diff --
   
    That's a neat idea, Keith. Sounds like it would be prudent to define an interface so that we could easily stub out different implementations on how summaries are collected/merged.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/180#discussion_r90980177
 
    --- Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SummaryRetriever.java ---
    @@ -0,0 +1,46 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.accumulo.core.client.admin;
    +
    +import java.util.Map;
    +
    +import org.apache.hadoop.io.Text;
    +
    +public interface SummaryRetriever {
    +
    +  /**
    +   * Allows optionally setting start row before retrieving data.
    +   */
    +  SummaryRetriever start(Text startRow);
    --- End diff --
   
    @joshelser I created something called SummaryStore in ab8deb7.  It generates ~10 summaries for input data without knowing how many keys will be added beforhand.  Uses the approach I mentioned above of merging summaries.  One additional step I discovered I needed to do was double the number of key/vals per summary each time I merged.
   
    So if I have 10 summaries each covering 1000 keys, when I merge to 5 then each of those represents 2000 keys.  So now I want start adding summaries every 2000 keys.  Next time I merge, I will start adding summaries every 4000 keys... etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo issue #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on the issue:

    https://github.com/apache/accumulo/pull/180
 
    (breaking out discussion to top-level instead of rooted on a code change)
   
    >I created something called SummaryStore in ab8deb7. It generates ~10 summaries for input data without knowing how many keys will be added beforhand. Uses the approach I mentioned above of merging summaries. One additional step I discovered I needed to do was double the number of key/vals per summary each time I merged.
   
    > So if I have 10 summaries each covering 1000 keys, when I merge to 5 then each of those represents 2000 keys. So now I want start adding summaries every 2000 keys. Next time I merge, I will start adding summaries every 4000 keys... etc.
   
    The [comment I left here](https://github.com/apache/accumulo/commit/ab8deb76ef96602d27072ce913ee90434bb9c246#commitcomment-20080483) has really got my brain thinking about how this should look. First, can we design a good API without defining what a summary is. Or, reworded, can we restrict what a summary is and come up with a more elegant/efficient implementation.
   
    When it comes to `merge(Map, Map)`, I'm specifically wondering if we force all `KeyValueSummaries` to be commutative, can we manage the merging of summaries better? For example, instead of delegating to the implementation on how these two `Map`s are merged, we then get to merge them ourselves (because we know how to combine these maps).
   
    Should a summary ever not be commutative? Does it make sense for us to allow one to not be?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo issue #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on the issue:

    https://github.com/apache/accumulo/pull/180
 
    > For example, instead of delegating to the implementation on how these two Maps are merged, we then get to merge them ourselves (because we know how to combine these maps).
   
    How would we merge them?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo issue #180: Taking a crack at new summarization API

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on the issue:

    https://github.com/apache/accumulo/pull/180
 
    > How would we merge them?
   
    Thinking back to the visibility example, we combine the values for common keys between the maps and retain `N` entries with the largest values. Perhaps I didn't do a good job of posing the question explicitly: is it possible to implement this merge generically? Are there cases which we could/should support which would require some specialized knowledge or understanding to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
12