Processing Large Data Sets

We are processing large data sets on commodity hardware and are looking at how we would do the same on the NVIDIA platform…the journey starts today so any pointers on getting started would be appreciated.

Currently, we process 500M events per day, aggregating data for reports and insights using AWS and technologies such as Athena, S3 etc. This works well for us but we need to start investigating alternative solutions to handle 10x the data.

Any pointers would be appreciated

Hey @jima! Can you explain a bit more about what tools you use for your ETL pipeline? There may be a few options that will scale really well within the RAPIDS ecosystem: Happy to dive in a bit more once I better understand your workflow. Like are you using Pandas or SciKit Learn or using Kafka? What are your source data formats (JSON, csv, parquet)? Looking forward to your reply!

Tyder,

Thanks for your reply.

So we have hourly JSON gzipped files on S3, We need to process a days worth of day to provide a bunch of insights and reports to our stakeholders. We first process all the hours in parallel and then run a reduce step to produce the final aggregated results set. Everything is done using streaming-style processing in nodejs. The process currently takes around 15-20 minutes to complete.

Reach out if you need more information.

Please do check out the RAPIDS and BlazingSQL (built on RAPIDS) for out of core processing (due to your data sizes). This can be helpful in a few ways beyond just speeding up your ETL as you scale out. It would be good to clarify what your reduce step entails. From what I have read so far, it seems like you are performing agg. In that case, our groupby aggs and join operations are extremely performant, especially at scale.

ETL- I think this is a similar usecase (video from Walmart). You may get exactly what you seek. It may take a little rework on your side, as RAPIDS is in Python, but the results may be worth it. We have both JSON readers as well as streaming capability with custreams.

Interactive visual EDA and specialist tools- For insights, RAPIDS allows some interesting “in process, interactive visualizations”. Check out this notebook I made analyzing taxi data, this video on how we used it to build out a specialist tool for genomics and this webapp dashboard instructional with census data for COVID.

Thanks for the hand up. Definitely looking at changing the complete stack, from looking at Arrow, Parquet as well as improving processing times around aggregation and filters. I’ll definitely follow up on all the links you provided. In terms of language looking at both Python and Rust so open to any suggestion. We are looking at running all this in AWS cloud, but i also believe that NVIDIA is also looking at launching a GPU cloud platform.

Does RAPIDS integrate with Apache Arrow?

I had a look at a videos and it the suite of tools look compelling. I wonder if there is anyone i could reach out to talk through the best approach of retrieving data from S3, and writing aggregators using RAPIDS and then shoving that data back into S3. Is this a supported use case?

Yes, RAPIDS is uses Arrow as our memory format. To launch RAPIDS on AWS, you can check out these tutorials. To read S3 files, the best approach, you can use df = cudf.read_csv("<link to your s3 bucket files>') similar to what we did in this notebook where we did:

taxi2016 = cudf.read_csv("https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2016-01.csv")

You can then use python packages like s3-client or boto3 to send the post processed data back (there are several blogs on this, such as this one.

Please check out our site on https://rapids.ai/. We have a wealth of great info and resources there to help you quickly get up to speed.