[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
GitHub user noedetore opened a pull request:

    https://github.com/apache/accumulo/pull/220

    ACCUMULO-4591 replication latency metrics added

   

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/noedetore/accumulo ACCUMULO-4591

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/accumulo/pull/220.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #220
   
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102622494
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/Master.java ---
    @@ -192,6 +193,7 @@
       private ReplicationDriver replicationWorkDriver;
       private WorkDriver replicationWorkAssigner;
       RecoveryManager recoveryManager = null;
    +  private AtomicLong replicationLatency = new AtomicLong(0l);
    --- End diff --
   
    Personal preference: use upper-case `L` for Long literals. It's much easier to read (this looks like `01`).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102623116
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
   
    `System.currentTimeMillis()` is unreliable for tracking time differences, and this subtraction could even be negative. Use `System.nanoTime()` for time differences/durations. The difference is guaranteed to be positive.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102622870
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -57,9 +58,11 @@
       private static final Logger log = LoggerFactory.getLogger(RemoveCompleteReplicationRecords.class);
     
       private Connector conn;
    +  private Master master;
     
    -  public RemoveCompleteReplicationRecords(Connector conn) {
    +  public RemoveCompleteReplicationRecords(Connector conn, Master master) {
    --- End diff --
   
    Is this connector for the remote peer? If not, the master actually contains a client context and the connector is redundant if the master is being passed in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102770533
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
   
    I am not sure, but I think this createdTime may come from a persisted source?  @joshelser  is that right?  If its coming from a persisted source, then it could have come from another process.  In that case using `System.currentTimeMillis()` is reasonable.   `System.nanoTime()` can reliably compute time deltas within a process.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user noedetore commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102771300
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
   
    Using nanoTime() is new to me. In making sure there is no conflict, I did find in org.apache.accumulo.tserver.log.TabletServerLogger setting createdTime
    Status status = StatusUtil.fileCreated(System.currentTimeMillis());


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo issue #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on the issue:

    https://github.com/apache/accumulo/pull/220
 
    @noedetore is this metric showing the current maximum time for all active replication across all tables with replication enabled?
   
    Does anyone know of a place where individual metrics are documented? I looked in the user manual and did not see anything.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102774271
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
   
    I looked into this a bit more to see if using nanoTime would be reasonable across processes.  The following is from the nanoTime javaodc
   
    ```
    Returns the current value of the running Java Virtual Machine's high-resolution time source, in nanoseconds. This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
    ```
   
    Given this documentation, if the times are persisted and used across multiple processes then I think using `nanoTime()` would be inappropriate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo issue #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user noedetore commented on the issue:

    https://github.com/apache/accumulo/pull/220
 
    @keith-turner yes showing the current maximum for given time(10sec, configurable) when snapshot is called to collect metrics


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user noedetore commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102776327
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/Master.java ---
    @@ -192,6 +193,7 @@
       private ReplicationDriver replicationWorkDriver;
       private WorkDriver replicationWorkAssigner;
       RecoveryManager recoveryManager = null;
    +  private AtomicLong replicationLatency = new AtomicLong(0l);
    --- End diff --
   
    will fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102777287
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -57,9 +58,11 @@
       private static final Logger log = LoggerFactory.getLogger(RemoveCompleteReplicationRecords.class);
     
       private Connector conn;
    +  private Master master;
     
    -  public RemoveCompleteReplicationRecords(Connector conn) {
    +  public RemoveCompleteReplicationRecords(Connector conn, Master master) {
    --- End diff --
   
    bq. Is this connector for the remote peer?
   
    No, it's for the "local" instance. This is just cleaning up stuff in the accumulo.replication table on the "source" instance.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102778301
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
   
    You're collecting the "maximum" latency in the system which isn't ideal as it paints a very skewed picture of the actual system.
   
    For example, if you have a single file which cannot be replicated (say, it's corrupted) it would have a very long latency. The rest of the files may be replicating in a (hypothetically) 5seconds. The maximum latency would continue to increase without bound, indicating to an operator that the system is not healthy, when the system is actually 99% healthy.
   
    Hadoop metrics2 exposes a histogram class for metrics which automatically aggregates min, median, avg, max as well as percentiles (e.g. p75, p85, p95, p98, p99).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102778787
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -197,6 +200,7 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
         mutations.add(m);
         for (Entry<String,Long> entry : tableToTimeCreated.entrySet()) {
           log.info("Removing order mutation for table {} at {} for {}", entry.getKey(), entry.getValue(), row.toString());
    +      collectLatency(entry.getValue());
    --- End diff --
   
    Building on the point above: you would want to collect the latencies of each outstanding file to be replicated, likely not within the execution path of the Replication internals (metrics should not affect the latency of the system).
   
    It would be better to pull this logic out of `RCRR`, and add it into the Metrics2ReplicationMetrics, enumerating the files and computing the histogram of metrics during the metrics collection cycle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102779143
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
   
    > I am not sure, but I think this createdTime may come from a persisted source
   
    I believe createdTime comes from the TabletServer, likely calling currentTimeMillis(). As long as we're reporting non-negative latencies, I think it will be fine (e.g. take the max of 0 and the computed latency).
   
    The createdTime is coming from a remote JVM (the tserver), so I dont' think there's a need to use nanoTime here for the monotonically increasing time. We can just fudge it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user keith-turner commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102783229
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
   
    I like the idea of taking the max of 0 and computed latency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo issue #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on the issue:

    https://github.com/apache/accumulo/pull/220
 
    > Does anyone know of a place where individual metrics are documented? I looked in the user manual and did not see anything.
   
    I don't think we have this broken out into the user manual. It's very ad-hoc. We could experiment with lifting all of the metrics keys into an interface, creating a description for each, and then automate the addition of these into the user manual like the configuration properties work presently.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user noedetore commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102800222
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
   
    Screen Shot 2017-02-23 at 9.31.06 AM shows the spike but still helpful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102801086
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
   
    I guess... but there's really no reason to fudge when there's a built-in API for this in the JVM. I realize we're still using currentTimeMillis elsewhere, but I'm trying to encourage us to move away from that, for correctness.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user ctubbsii commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102801369
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/replication/RemoveCompleteReplicationRecords.java ---
    @@ -216,4 +220,13 @@ protected long removeRowIfNecessary(BatchWriter bw, SortedMap<Key,Value> columns
     
         return recordsRemoved;
       }
    +
    +  /*
    +   * Metrics calls snapshot. Largest time stamp will be found for each snapshot then reset.
    +   */
    +  private void collectLatency(Long createdTime) {
    +    long latency = System.currentTimeMillis() - createdTime;
    --- End diff --
   
    Definitely don't use nanoTime across processes. It's only guaranteed that `(newtime - oldtime) > 0` within the same process.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
Reply | Threaded
Open this post in threaded view
|

[GitHub] accumulo pull request #220: ACCUMULO-4591 replication latency metrics added

ctubbsii
In reply to this post by ctubbsii
Github user joshelser commented on a diff in the pull request:

    https://github.com/apache/accumulo/pull/220#discussion_r102804050
 
    --- Diff: server/master/src/main/java/org/apache/accumulo/master/metrics/Metrics2ReplicationMetrics.java ---
    @@ -58,6 +59,7 @@ protected void snapshot() {
         registry.add(PENDING_FILES, getNumFilesPendingReplication());
         registry.add(NUM_PEERS, getNumConfiguredPeers());
         registry.add(MAX_REPLICATION_THREADS, getMaxReplicationThreads());
    +    registry.add(LATENCY, getLatencyInSeconds());
    --- End diff --
   
    Yes, not saying that having the maximum isn't helpful, but it paints a very skewed picture.
   
    We have the primitives to collect and display metrics which paint a complete picture. I really think that we should use them instead of only displaying the maximum.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [hidden email] or file a JIRA ticket
with INFRA.
---
12