adding HDFS erasure coding awareness to Accumulo

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

adding HDFS erasure coding awareness to Accumulo

Seidl, Ed
Hi All.

Following up on a conversation with Christopher on the slack channel, what follows is a modest proposal to make hosting Accumulo tables on erasure coded HDFS directories easier. This post turned out to be pretty long…if you already know what erasure coding in HDFS is about, skip down a page to the paragraph that starts with “Sorry”.

First things first, a brief intro to erasure coding (EC).  As we all know, by default HDFS file systems achieve durability via block replication.  Usually the replication level is set to 3, so the resulting disk overhead for reliability is 200%.  (Yes, there are also performance benefits to the replication when disk locality can be exploited, but I'm going to ignore that for now).  Hadoop
3.0 introduced EC as a better way to achieve durability.  More info can be found here (https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html).  EC behaves much like RAID...for M blocks of data, N blocks of parity data are generated, from which the original data can be recovered in the event of data failures.  A typical EC scheme to use is Reed-Solomon 6-3, where 6 blocks of data produce 3 blocks of parity, an overhead of only 50%.  In addition to the factor of 2 increase in available disk space, RS-6-3 is also more fault tolerant...a loss of 3 data blocks can be tolerated, compared to triple replication where only two blocks can be lost.

More storage, better resiliency, so what's the catch?  One concern with using EC is the time spent calculating the parity blocks.  Unlike the default replication write path, where a client writes a block, and then the datanodes take care of replicating the data, an EC HDFS client is responsible for computing the parity and sending that to the datanodes.  This increases the CPU and network load on the client.  The CPU
hit can be mitigated through the use of the intel ISA-L library, but only on CPUs that support AVX.  (See https://www.slideshare.net/HadoopSummit/debunking-the-myths-of-hdfs-erasure-coding-performance and https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/ for some interesting claims).  In our testing, we've found that sequential writes to an EC encoded directory can be as much as 3 times faster than to a directory with replication, and reads are up to 2 times faster.

Another side effect of EC is the loss of data locality.  For performance reasons, EC data blocks are striped, so multiple datanodes must be contacted to read a single block of data.  For large sequential reads this doesn't appear to be much of a problem, but it can be an issue for small random lookups.  For the latter case, we've found that using 64K stripes (rather than the default 1M) can mitigate some of the random lookup pain without compromising sequential read/write performance.

In terms of Accumulo performance, we don't see a dramatic difference in scan performance between EC and replication.  This is due to the fact that our tables are gzip compressed, so the time to decompress and then do the deserialization and other key management tasks far outweighs the actual disk I/O times.  The same seems to be the case for batch writing of RFiles.  Here are the results for a test I did using 40 Spark executors, each writing 10M rows to RFiles in RS-6-3 and replicated directories:

       Time (min)
       EC63   REPL  size (GB)
gz     2.7    2.7     21.3
none   2.0    3.0    158.5
snappy 1.6    1.6     38.4

The only time EC makes a difference here is when compression is turned off...then the write speed is 50% faster with EC.  (It's interesting to note one nice trade-off that EC makes possible...using faster Snappy compression uses 2X the disk space, but if EC is used, then you get that 2x back).

As noted above, random access times can be a problem with EC.  This impacts Accumulo's ability to randomly fetch single rows of data.  In a test of 16 Spark executors doing random seeks in batches of 10 into a table with 10B rows, we found the following latencies per row:

            latency per row(ms)
              min  max  avg
RS-10-4-1M      4  214   11
RS-6-3-1M       3  130    7
RS-6-3-64K      2  106    4
replication     1   73    2

One big gotcha that was not immediately evident is this:  the current HDFS EC implementation does not support hflush() or hsync().  These operations are no-ops. I discovered this the hard way when we had an unexpected power outage in our data center.  I had been using EC for the write-ahead log directories, and Accumulo was in the midst of updating the metadata table when the power went out.  Because flush()
returned without actually writing anything to disk, we lost some writes to the metadata table, which resulted in all of our tables being unreadable.  Thankfully it was a dev system so the loss wasn't a big deal (plus the data was still there, and I did recover what I needed by re-importing the RFiles).  The good news is that EC is not an all-or-nothing thing, but is instead implemented on a per-directory basis,
with children inheriting their parent's policy at creation time.  So moving forward, we keep our WAL and the accumulo namespace in replicated directories, but use EC for the rest of our table data.

Sorry if that was a TL;DR...now to my proposal.  As it stands now, EC is controlled at the HDFS directory level, so the only way to turn it on for Accumulo is to use the "hdfs ec" command to set the encoding on individual directories under /accumulo/tables. So, to create a table with EC, the steps are 1) create table, 2) look up table ID, 3) use command line to set policy for /accumulo/tables/<ID>.  What I propose is to add a per-table/namespace property “table.encoding.policy”.  Setting the policy for a namespace would ensure all tables subsequently created in that namespace would inherit that policy, but this could then be overridden on individual tables if need be.  One caveat here is that changing the policy on an existing directory does not change the policy on the files within it...instead the data needs to be copied so that it is rewritten with the appropriate policy.  Thus, to change the encoding policy for an existing table would require a major compaction to rewrite all the RFiles for that table.

And while we're adding encoding policy, I thought it would also be good to be able to specify the HDFS storage policy (HOT, COLD, WARM, etc.), which is also set on a per-directory basis. So, a second property “table.storage.policy” is proposed (but I should note that my humble cluster won’t allow me to test this beyond setting the policy on the directories…I don’t have tiered storage to see if it actually makes a difference).

Both the encoding and storage policies can be enforced by adding another mkdirs() method to org.apache.accumulo.server.fs.VolumeManager that takes the path, storage policy, and encoding policy as arguments, as well as a helper function checkDirPolices().  This would require changes to many of the operations under master/tableOps that call mkdirs(). Some changes to org.apache.accumulo.tserver.tablet.Tablet will also be needed to account for directory creation during splits, as well as detecting property changes.

I have this implemented in Accumulo 2.0, and can share patches if there is interest.  I'm worried that the approach I took is a little too HDFS specific, so for sure some thought would have to go into how to modify the HDFS implementation without adding burden down the road should another filesystem implementation be added.

I welcome any thoughts, suggestions or not-too-barbed criticisms :)  Thanks for reading.

Ed
Reply | Threaded
Open this post in threaded view
|

Re: adding HDFS erasure coding awareness to Accumulo

Keith Turner
On Thu, Sep 5, 2019 at 7:25 PM Seidl, Ed <[hidden email]> wrote:

>
> Hi All.
>
> Following up on a conversation with Christopher on the slack channel, what follows is a modest proposal to make hosting Accumulo tables on erasure coded HDFS directories easier. This post turned out to be pretty long…if you already know what erasure coding in HDFS is about, skip down a page to the paragraph that starts with “Sorry”.
>
> First things first, a brief intro to erasure coding (EC).  As we all know, by default HDFS file systems achieve durability via block replication.  Usually the replication level is set to 3, so the resulting disk overhead for reliability is 200%.  (Yes, there are also performance benefits to the replication when disk locality can be exploited, but I'm going to ignore that for now).  Hadoop
> 3.0 introduced EC as a better way to achieve durability.  More info can be found here (https://hadoop.apache.org/docs/r3.2.0/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html).  EC behaves much like RAID...for M blocks of data, N blocks of parity data are generated, from which the original data can be recovered in the event of data failures.  A typical EC scheme to use is Reed-Solomon 6-3, where 6 blocks of data produce 3 blocks of parity, an overhead of only 50%.  In addition to the factor of 2 increase in available disk space, RS-6-3 is also more fault tolerant...a loss of 3 data blocks can be tolerated, compared to triple replication where only two blocks can be lost.
>
> More storage, better resiliency, so what's the catch?  One concern with using EC is the time spent calculating the parity blocks.  Unlike the default replication write path, where a client writes a block, and then the datanodes take care of replicating the data, an EC HDFS client is responsible for computing the parity and sending that to the datanodes.  This increases the CPU and network load on the client.  The CPU
> hit can be mitigated through the use of the intel ISA-L library, but only on CPUs that support AVX.  (See https://www.slideshare.net/HadoopSummit/debunking-the-myths-of-hdfs-erasure-coding-performance and https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/ for some interesting claims).  In our testing, we've found that sequential writes to an EC encoded directory can be as much as 3 times faster than to a directory with replication, and reads are up to 2 times faster.
>
> Another side effect of EC is the loss of data locality.  For performance reasons, EC data blocks are striped, so multiple datanodes must be contacted to read a single block of data.  For large sequential reads this doesn't appear to be much of a problem, but it can be an issue for small random lookups.  For the latter case, we've found that using 64K stripes (rather than the default 1M) can mitigate some of the random lookup pain without compromising sequential read/write performance.
>
> In terms of Accumulo performance, we don't see a dramatic difference in scan performance between EC and replication.  This is due to the fact that our tables are gzip compressed, so the time to decompress and then do the deserialization and other key management tasks far outweighs the actual disk I/O times.  The same seems to be the case for batch writing of RFiles.  Here are the results for a test I did using 40 Spark executors, each writing 10M rows to RFiles in RS-6-3 and replicated directories:
>
>        Time (min)
>        EC63   REPL  size (GB)
> gz     2.7    2.7     21.3
> none   2.0    3.0    158.5
> snappy 1.6    1.6     38.4
>
> The only time EC makes a difference here is when compression is turned off...then the write speed is 50% faster with EC.  (It's interesting to note one nice trade-off that EC makes possible...using faster Snappy compression uses 2X the disk space, but if EC is used, then you get that 2x back).
>
> As noted above, random access times can be a problem with EC.  This impacts Accumulo's ability to randomly fetch single rows of data.  In a test of 16 Spark executors doing random seeks in batches of 10 into a table with 10B rows, we found the following latencies per row:
>
>             latency per row(ms)
>               min  max  avg
> RS-10-4-1M      4  214   11
> RS-6-3-1M       3  130    7
> RS-6-3-64K      2  106    4
> replication     1   73    2
>
> One big gotcha that was not immediately evident is this:  the current HDFS EC implementation does not support hflush() or hsync().  These operations are no-ops. I discovered this the hard way when we had an unexpected power outage in our data center.  I had been using EC for the write-ahead log directories, and Accumulo was in the midst of updating the metadata table when the power went out.  Because flush()
> returned without actually writing anything to disk, we lost some writes to the metadata table, which resulted in all of our tables being unreadable.  Thankfully it was a dev system so the loss wasn't a big deal (plus the data was still there, and I did recover what I needed by re-importing the RFiles).  The good news is that EC is not an all-or-nothing thing, but is instead implemented on a per-directory basis,
> with children inheriting their parent's policy at creation time.  So moving forward, we keep our WAL and the accumulo namespace in replicated directories, but use EC for the rest of our table data.
>
> Sorry if that was a TL;DR...now to my proposal.  As it stands now, EC is controlled at the HDFS directory level, so the only way to turn it on for Accumulo is to use the "hdfs ec" command to set the encoding on individual directories under /accumulo/tables. So, to create a table with EC, the steps are 1) create table, 2) look up table ID, 3) use command line to set policy for /accumulo/tables/<ID>.  What I propose is to add a per-table/namespace property “table.encoding.policy”.  Setting the policy for a namespace would ensure all tables subsequently created in that namespace would inherit that policy, but this could then be overridden on individual tables if need be.  One caveat here is that changing the policy on an existing directory does not change the policy on the files within it...instead the data needs to be copied so that it is rewritten with the appropriate policy.  Thus, to change the encoding policy for an existing table would require a major compaction to rewrite all the RFiles for that table.
>
> And while we're adding encoding policy, I thought it would also be good to be able to specify the HDFS storage policy (HOT, COLD, WARM, etc.), which is also set on a per-directory basis. So, a second property “table.storage.policy” is proposed (but I should note that my humble cluster won’t allow me to test this beyond setting the policy on the directories…I don’t have tiered storage to see if it actually makes a difference).
>
> Both the encoding and storage policies can be enforced by adding another mkdirs() method to org.apache.accumulo.server.fs.VolumeManager that takes the path, storage policy, and encoding policy as arguments, as well as a helper function checkDirPolices().  This would require changes to many of the operations under master/tableOps that call mkdirs(). Some changes to org.apache.accumulo.tserver.tablet.Tablet will also be needed to account for directory creation during splits, as well as detecting property changes.
>
> I have this implemented in Accumulo 2.0, and can share patches if there is interest.  I'm worried that the approach I took is a little too HDFS specific, so for sure some thought would have to go into how to modify the HDFS implementation without adding burden down the road should another filesystem implementation be added.

One possible way to handle this is to prefix the properties with
table.hdfs, like table.hdfs.storage.policy and
table.hdfs.encoding.policy.  This implies the settings are specific to
HDFS and are ignored when not using hdfs.  We have some existing per
table hdfs properties that should be considered as a whole with these
new properties.  Also would need to think through what if any impact
the properties may have on the volume chooser.

>
> I welcome any thoughts, suggestions or not-too-barbed criticisms :)  Thanks for reading.
>
> Ed