Accumulo on Azure / WebHDFS

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Accumulo on Azure / WebHDFS

James Hughes
Hi all,

I know folks have asked about Accumulo on S3 before (1).  

Has anyone tried running Accumulo on Azure's blob storage or data lake solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on WebHDFS?)

As more background, I have deployed Accumulo on HDP clouds in Azure, and that works great.  I'm interested in using the blob / data lake storage for benefits with scaling, etc.

Thanks in advance,

Jim

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Accumulo on Azure / WebHDFS

Josh Elser
Hi Jim,

I can say that Accumulo will work on Azure's blob store and their data
lake store. These are a result of testing I'm involved with at
Hortonworks (dayjob). I know that these filesystems are tested to an
appropriate degree, proving that they do provide the things that
Accumulo needs.

As a refresher, the things we need from a filesystem are: performance
(Accumulo's write performance is pretty dominated by I/O) and
durability guarantees (when we call sync() on a file, the data we just
wrote better be there).

For WebHDFS, I think you would both hurt for performance and I would
be surprised if it actually provided the durability correctness. My
understanding is that WebHDFS is more meant to allow non-Java clients
easy access to HDFS (as a one-off) rather than act as a fully-fledged
access layer.

- Josh

On Fri, Apr 14, 2017 at 10:16 AM, James Hughes <[hidden email]> wrote:

> Hi all,
>
> I know folks have asked about Accumulo on S3 before (1).
>
> Has anyone tried running Accumulo on Azure's blob storage or data lake
> solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on
> WebHDFS?)
>
> As more background, I have deployed Accumulo on HDP clouds in Azure, and
> that works great.  I'm interested in using the blob / data lake storage for
> benefits with scaling, etc.
>
> Thanks in advance,
>
> Jim
>
> 1.  http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html
> 2.
> https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Accumulo on Azure / WebHDFS

James Hughes
Hi Josh,

Thanks!  Sounds like Azure's offerings are providing better performance and sync()'ing over S3?  (I.e., is S3 still a no-go for Accumulo?)

Your description of WebHDFS makes totally sense.  I figured there may be an outside chance that WebHDFS handled or worked around limitations from S3, etc.

Cheers,

Jim

On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser <[hidden email]> wrote:
Hi Jim,

I can say that Accumulo will work on Azure's blob store and their data
lake store. These are a result of testing I'm involved with at
Hortonworks (dayjob). I know that these filesystems are tested to an
appropriate degree, proving that they do provide the things that
Accumulo needs.

As a refresher, the things we need from a filesystem are: performance
(Accumulo's write performance is pretty dominated by I/O) and
durability guarantees (when we call sync() on a file, the data we just
wrote better be there).

For WebHDFS, I think you would both hurt for performance and I would
be surprised if it actually provided the durability correctness. My
understanding is that WebHDFS is more meant to allow non-Java clients
easy access to HDFS (as a one-off) rather than act as a fully-fledged
access layer.

- Josh

On Fri, Apr 14, 2017 at 10:16 AM, James Hughes <[hidden email]> wrote:
> Hi all,
>
> I know folks have asked about Accumulo on S3 before (1).
>
> Has anyone tried running Accumulo on Azure's blob storage or data lake
> solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on
> WebHDFS?)
>
> As more background, I have deployed Accumulo on HDP clouds in Azure, and
> that works great.  I'm interested in using the blob / data lake storage for
> benefits with scaling, etc.
>
> Thanks in advance,
>
> Jim
>
> 1.  http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html
> 2.
> https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services

Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Accumulo on Azure / WebHDFS

Josh Elser
As I understand it, S3 is currently still a non-starter.

Long term, Amazon may provide some more features to fix the sync issue. Or, someone can modify Accumulo to support putting rfiles on s3 exclusively.

Happy to expand on this further if you're curious.


On Apr 14, 2017 15:16, "James Hughes" <[hidden email]> wrote:
Hi Josh,

Thanks!  Sounds like Azure's offerings are providing better performance and sync()'ing over S3?  (I.e., is S3 still a no-go for Accumulo?)

Your description of WebHDFS makes totally sense.  I figured there may be an outside chance that WebHDFS handled or worked around limitations from S3, etc.

Cheers,

Jim

On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser <[hidden email]> wrote:
Hi Jim,

I can say that Accumulo will work on Azure's blob store and their data
lake store. These are a result of testing I'm involved with at
Hortonworks (dayjob). I know that these filesystems are tested to an
appropriate degree, proving that they do provide the things that
Accumulo needs.

As a refresher, the things we need from a filesystem are: performance
(Accumulo's write performance is pretty dominated by I/O) and
durability guarantees (when we call sync() on a file, the data we just
wrote better be there).

For WebHDFS, I think you would both hurt for performance and I would
be surprised if it actually provided the durability correctness. My
understanding is that WebHDFS is more meant to allow non-Java clients
easy access to HDFS (as a one-off) rather than act as a fully-fledged
access layer.

- Josh

On Fri, Apr 14, 2017 at 10:16 AM, James Hughes <[hidden email]> wrote:
> Hi all,
>
> I know folks have asked about Accumulo on S3 before (1).
>
> Has anyone tried running Accumulo on Azure's blob storage or data lake
> solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on
> WebHDFS?)
>
> As more background, I have deployed Accumulo on HDP clouds in Azure, and
> that works great.  I'm interested in using the blob / data lake storage for
> benefits with scaling, etc.
>
> Thanks in advance,
>
> Jim
>
> 1.  http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html
> 2.
> https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services


Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Accumulo on Azure / WebHDFS

James Hughes
Hi Josh,

Thanks again!  

As a follow-up, is any of the information about Accumulo on WASB or ADL public?  I suppose I'm curious about configuration (is it just plug-and-play?) and performance.

Thanks in advance,

Jim

On Sat, Apr 15, 2017 at 2:25 PM, Josh Elser <[hidden email]> wrote:
As I understand it, S3 is currently still a non-starter.

Long term, Amazon may provide some more features to fix the sync issue. Or, someone can modify Accumulo to support putting rfiles on s3 exclusively.

Happy to expand on this further if you're curious.


On Apr 14, 2017 15:16, "James Hughes" <[hidden email]> wrote:
Hi Josh,

Thanks!  Sounds like Azure's offerings are providing better performance and sync()'ing over S3?  (I.e., is S3 still a no-go for Accumulo?)

Your description of WebHDFS makes totally sense.  I figured there may be an outside chance that WebHDFS handled or worked around limitations from S3, etc.

Cheers,

Jim

On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser <[hidden email]> wrote:
Hi Jim,

I can say that Accumulo will work on Azure's blob store and their data
lake store. These are a result of testing I'm involved with at
Hortonworks (dayjob). I know that these filesystems are tested to an
appropriate degree, proving that they do provide the things that
Accumulo needs.

As a refresher, the things we need from a filesystem are: performance
(Accumulo's write performance is pretty dominated by I/O) and
durability guarantees (when we call sync() on a file, the data we just
wrote better be there).

For WebHDFS, I think you would both hurt for performance and I would
be surprised if it actually provided the durability correctness. My
understanding is that WebHDFS is more meant to allow non-Java clients
easy access to HDFS (as a one-off) rather than act as a fully-fledged
access layer.

- Josh

On Fri, Apr 14, 2017 at 10:16 AM, James Hughes <[hidden email]> wrote:
> Hi all,
>
> I know folks have asked about Accumulo on S3 before (1).
>
> Has anyone tried running Accumulo on Azure's blob storage or data lake
> solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on
> WebHDFS?)
>
> As more background, I have deployed Accumulo on HDP clouds in Azure, and
> that works great.  I'm interested in using the blob / data lake storage for
> benefits with scaling, etc.
>
> Thanks in advance,
>
> Jim
>
> 1.  http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html
> 2.
> https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services



Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Accumulo on Azure / WebHDFS

Josh Elser
I don't have any performance numbers handy. I'm not sure if
Microsoft/Azure-team publishes them.

In general, my understanding is that each of them are intended to be
"drop-in replacements". There might be some implementation specific
configuration (e.g. account/billing), but that's it.

James Hughes wrote:

> Hi Josh,
>
> Thanks again!
>
> As a follow-up, is any of the information about Accumulo on WASB or ADL
> public?  I suppose I'm curious about configuration (is it just
> plug-and-play?) and performance.
>
> Thanks in advance,
>
> Jim
>
> On Sat, Apr 15, 2017 at 2:25 PM, Josh Elser <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     As I understand it, S3 is currently still a non-starter.
>
>     Long term, Amazon may provide some more features to fix the sync
>     issue. Or, someone can modify Accumulo to support putting rfiles on
>     s3 exclusively.
>
>     Happy to expand on this further if you're curious.
>
>
>     On Apr 14, 2017 15:16, "James Hughes" <[hidden email]
>     <mailto:[hidden email]>> wrote:
>
>         Hi Josh,
>
>         Thanks!  Sounds like Azure's offerings are providing better
>         performance and sync()'ing over S3?  (I.e., is S3 still a no-go
>         for Accumulo?)
>
>         Your description of WebHDFS makes totally sense.  I figured
>         there may be an outside chance that WebHDFS handled or worked
>         around limitations from S3, etc.
>
>         Cheers,
>
>         Jim
>
>         On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser
>         <[hidden email] <mailto:[hidden email]>> wrote:
>
>             Hi Jim,
>
>             I can say that Accumulo will work on Azure's blob store and
>             their data
>             lake store. These are a result of testing I'm involved with at
>             Hortonworks (dayjob). I know that these filesystems are
>             tested to an
>             appropriate degree, proving that they do provide the things that
>             Accumulo needs.
>
>             As a refresher, the things we need from a filesystem are:
>             performance
>             (Accumulo's write performance is pretty dominated by I/O) and
>             durability guarantees (when we call sync() on a file, the
>             data we just
>             wrote better be there).
>
>             For WebHDFS, I think you would both hurt for performance and
>             I would
>             be surprised if it actually provided the durability
>             correctness. My
>             understanding is that WebHDFS is more meant to allow
>             non-Java clients
>             easy access to HDFS (as a one-off) rather than act as a
>             fully-fledged
>             access layer.
>
>             - Josh
>
>             On Fri, Apr 14, 2017 at 10:16 AM, James Hughes
>             <[hidden email] <mailto:[hidden email]>> wrote:
>              > Hi all,
>              >
>              > I know folks have asked about Accumulo on S3 before (1).
>              >
>              > Has anyone tried running Accumulo on Azure's blob storage
>             or data lake
>              > solutions (2)?  (Or perhaps more generally, has anyone
>             tried Accumulo on
>              > WebHDFS?)
>              >
>              > As more background, I have deployed Accumulo on HDP
>             clouds in Azure, and
>              > that works great.  I'm interested in using the blob /
>             data lake storage for
>              > benefits with scaling, etc.
>              >
>              > Thanks in advance,
>              >
>              > Jim
>              >
>              > 1.
>             http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html
>             <http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html>
>              > 2.
>              >
>             https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services
>             <https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Accumulo on Azure / WebHDFS

James Hughes
Thanks.  Can you say if the performance is on par with a cloud you might otherwise spin-up?

In terms of the drop-in bits, it is as easy as setting 'instance.volumes' to point at the new URL?

Thanks!

On Mon, Apr 17, 2017 at 4:57 PM, Josh Elser <[hidden email]> wrote:
I don't have any performance numbers handy. I'm not sure if Microsoft/Azure-team publishes them.

In general, my understanding is that each of them are intended to be "drop-in replacements". There might be some implementation specific configuration (e.g. account/billing), but that's it.

James Hughes wrote:
Hi Josh,

Thanks again!

As a follow-up, is any of the information about Accumulo on WASB or ADL
public?  I suppose I'm curious about configuration (is it just
plug-and-play?) and performance.

Thanks in advance,

Jim

On Sat, Apr 15, 2017 at 2:25 PM, Josh Elser <[hidden email]
<mailto:[hidden email]>> wrote:

    As I understand it, S3 is currently still a non-starter.

    Long term, Amazon may provide some more features to fix the sync
    issue. Or, someone can modify Accumulo to support putting rfiles on
    s3 exclusively.

    Happy to expand on this further if you're curious.


    On Apr 14, 2017 15:16, "James Hughes" <[hidden email]
    <mailto:[hidden email]>> wrote:

        Hi Josh,

        Thanks!  Sounds like Azure's offerings are providing better
        performance and sync()'ing over S3?  (I.e., is S3 still a no-go
        for Accumulo?)

        Your description of WebHDFS makes totally sense.  I figured
        there may be an outside chance that WebHDFS handled or worked
        around limitations from S3, etc.

        Cheers,

        Jim

        On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser
        <[hidden email] <mailto:[hidden email]>> wrote:

            Hi Jim,

            I can say that Accumulo will work on Azure's blob store and
            their data
            lake store. These are a result of testing I'm involved with at
            Hortonworks (dayjob). I know that these filesystems are
            tested to an
            appropriate degree, proving that they do provide the things that
            Accumulo needs.

            As a refresher, the things we need from a filesystem are:
            performance
            (Accumulo's write performance is pretty dominated by I/O) and
            durability guarantees (when we call sync() on a file, the
            data we just
            wrote better be there).

            For WebHDFS, I think you would both hurt for performance and
            I would
            be surprised if it actually provided the durability
            correctness. My
            understanding is that WebHDFS is more meant to allow
            non-Java clients
            easy access to HDFS (as a one-off) rather than act as a
            fully-fledged
            access layer.

            - Josh

            On Fri, Apr 14, 2017 at 10:16 AM, James Hughes
            <[hidden email] <mailto:[hidden email]>> wrote:
             > Hi all,
             >
             > I know folks have asked about Accumulo on S3 before (1).
             >
             > Has anyone tried running Accumulo on Azure's blob storage
            or data lake
             > solutions (2)?  (Or perhaps more generally, has anyone
            tried Accumulo on
             > WebHDFS?)
             >
             > As more background, I have deployed Accumulo on HDP
            clouds in Azure, and
             > that works great.  I'm interested in using the blob /
            data lake storage for
             > benefits with scaling, etc.
             >
             > Thanks in advance,
             >
             > Jim
             >
             > 1.
            http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html
            <http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html>
             > 2.
             >
            https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services
            <https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services>





Loading...