Processing Large Data Sets

We are processing large data sets on commodity hardware and are looking at how we would do the same on the NVIDIA platform…the journey starts today so any pointers on getting started would be appreciated.

Currently, we process 500M events per day, aggregating data for reports and insights using AWS and technologies such as Athena, S3 etc. This works well for us but we need to start investigating alternative solutions to handle 10x the data.

Any pointers would be appreciated

Hey @jima! Can you explain a bit more about what tools you use for your ETL pipeline? There may be a few options that will scale really well within the RAPIDS ecosystem: Happy to dive in a bit more once I better understand your workflow. Like are you using Pandas or SciKit Learn or using Kafka? What are your source data formats (JSON, csv, parquet)? Looking forward to your reply!


Thanks for your reply.

So we have hourly JSON gzipped files on S3, We need to process a days worth of day to provide a bunch of insights and reports to our stakeholders. We first process all the hours in parallel and then run a reduce step to produce the final aggregated results set. Everything is done using streaming-style processing in nodejs. The process currently takes around 15-20 minutes to complete.

Reach out if you need more information.

Please do check out the RAPIDS and BlazingSQL (built on RAPIDS) for out of core processing (due to your data sizes). This can be helpful in a few ways beyond just speeding up your ETL as you scale out. It would be good to clarify what your reduce step entails. From what I have read so far, it seems like you are performing agg. In that case, our groupby aggs and join operations are extremely performant, especially at scale.

ETL- I think this is a similar usecase (video from Walmart). You may get exactly what you seek. It may take a little rework on your side, as RAPIDS is in Python, but the results may be worth it. We have both JSON readers as well as streaming capability with custreams.

Interactive visual EDA and specialist tools- For insights, RAPIDS allows some interesting “in process, interactive visualizations”. Check out this notebook I made analyzing taxi data, this video on how we used it to build out a specialist tool for genomics and this webapp dashboard instructional with census data for COVID.

Thanks for the hand up. Definitely looking at changing the complete stack, from looking at Arrow, Parquet as well as improving processing times around aggregation and filters. I’ll definitely follow up on all the links you provided. In terms of language looking at both Python and Rust so open to any suggestion. We are looking at running all this in AWS cloud, but i also believe that NVIDIA is also looking at launching a GPU cloud platform.

Does RAPIDS integrate with Apache Arrow?

I had a look at a videos and it the suite of tools look compelling. I wonder if there is anyone i could reach out to talk through the best approach of retrieving data from S3, and writing aggregators using RAPIDS and then shoving that data back into S3. Is this a supported use case?

Yes, RAPIDS is uses Arrow as our memory format. To launch RAPIDS on AWS, you can check out these tutorials. To read S3 files, the best approach, you can use df = cudf.read_csv("<link to your s3 bucket files>') similar to what we did in this notebook where we did:

taxi2016 = cudf.read_csv("")

You can then use python packages like s3-client or boto3 to send the post processed data back (there are several blogs on this, such as this one.

Please check out our site on We have a wealth of great info and resources there to help you quickly get up to speed.

Lacking CPU, your program runs slower; deficient with regards to memory, your program crashes. Yet, you can process bigger than-RAM datasets in Python, as you’ll learn in the accompanying series of articles.

Code structure
Replicating information is inefficient, transforming information is perilous
Replicating information squanders memory, and adjusting/changing information can prompt bugs. Figure out how to carry out a trade-off between the two in Python: stowed away alterability.

Sticking to memory: how Python work calls can build memory use
Python will consequently free items that aren’t being utilized. At times work calls can surprisingly keep objects in memory; realize the reason why, and how to fix it.

Gigantic memory overhead: Numbers in Python and how NumPy makes a difference
Putting away numbers or floats in Python has a colossal overhead in memory. Realize the reason why, and how NumPy improves things.

An excessive number of items: Reducing memory overhead from Python occurrences
Objects in Python have enormous memory overhead. Realize the reason why, and how to treat it: staying away from dicts, fewer items, and then some.

Information the executive’s procedures
Assessing and displaying memory prerequisites for information handling
Figure out how to how to quantify and display memory use for Python information handling cluster occupations in light of info size.

At the point when your information doesn’t fit in memory: the essential methods
You can handle information that doesn’t fit in memory by utilizing four fundamental procedures: burning through cash, pressure, lumping, and ordering.

Estimating the memory utilization of a Pandas DataFrame
Figure out how to precisely quantify memory use of your Pandas DataFrame or Series.

Diminishing Pandas memory utilization #1: lossless pressure
Load an enormous CSV or different information into Pandas utilizing less memory with procedures like dropping segments, more modest numeric types, categoricals, and inadequate segments.

Lessening Pandas memory utilization #2: lossy pressure
Lessen Panda’s memory utilization by dropping subtleties or information that aren’t as significant.

Lessening Pandas memory utilization #3: Reading in pieces
Lessen Pandas memory utilization by stacking and afterward handling a document in lumps rather than at the same time.

Quick subsets of enormous datasets with Pandas and SQLite
You have a lot of information, and you need to stack just part into memory as a Pandas data frame. One simple method for getting it done: ordering by means of SQLite information base.

Stacking SQL information into Pandas without running out of memory
Pandas can stack information from a SQL inquiry, however, the outcome might utilize an excess of memory. Figure out how to deal with information in bunches, and diminish memory utilization much further.

Saving memory with Pandas 1.3’s new string dtype
Putting away strings in Pandas can utilize a great deal of memory, however with Pandas 1.3 you approach a more current, more proficient choice.

From lumping to parallelism: quicker Pandas with Dask
Figure out the way that Dask can both accelerate your Panda’s information handling with parallelization, and lessen memory use with straightforward lumping.

Decreasing NumPy memory use with lossless pressure
Decrease NumPy memory use by picking more modest dtypes, and utilizing meager clusters.

NumPy sees: saving memory, spilling memory, and unobtrusive bugs
NumPy utilizes memory sees straightforwardly, as a method for saving memory. Yet, you want to see how they work, so you don’t spill memory, or alter information unintentionally.

Stacking NumPy clusters from plate: map() versus Zarr/HDF5
Assuming your NumPy cluster is bigger than memory, you can stack it straightforwardly from circle utilizing either map() or the very much like Zarr and HDF5 record designs.

The mmap() duplicates on-compose stunt: diminishing memory use of cluster duplicates
Duplicating a NumPy cluster and altering it pairs the memory utilization. However, by using the working framework’s mmap() call, you can pay for what you change.

Estimating memory use
Estimating memory utilization in Python: it’s precarious!
Estimating your Python program’s memory utilization isn’t generally so direct as you would might suspect. Learn two procedures, and the tradeoffs between them.

Fil: another Python memory profiler for information researchers and researchers
Fil is a Python memory profiler planned explicitly for the requirements of information researchers and researchers running information handling pipelines.

Troubleshooting Python out-of-memory crashes with the Fil profiler
Troubleshooting Python out-of-memory accidents can be interesting. Figure out how the Fil memory profiler can assist you with observing where your memory use is going on.

Passing on, quick and slow: out-of-memory crashes in Python
There are numerous ways Python out-of-memory issues can show: gradualness due to trading, crashes, MemoryError, segfaults, kill - 9.

Troubleshooting Python server memory spills with the Fil profiler
At the point when your Python server is releasing memory, the Fil memory profiler can assist you with recognizing the buggy code.

we can reduce the error in our model and if we have any categorical missing data, we should replace the missing value with the most frequent value in the columns.
Source:- Data Preprocessing with Python

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.