Best I/O practices when dealing with data heavy models

Hi Forum,

I’m considering rewriting some of my ecological modelling code into a CUDA accelerated format.

However, I’m wondering if any speed gains would be significant given my setup, some advice would be welcome.

To sketch my setup:

Ecological model drivers usually come in some gridded format (.nc / .netcdf / .hdf) with a z-axis as time (x, y as space). These data are fed into a function (model) and the output is either a vector(s) of length (z), or some summary statistics (<z). This is a rather embarrassingly parallel problem.

However, past experience has led me to believe that if the model is rather simple the bottleneck of this model setup sits at reading / writing the data from / to disk. For example, when doing an uncertainty analysis I hardly see a speed penalty for running my model 30 times with different parameters (when not reading in the same data 30 times).

My current workflow looks like this:

  1. read in data (a,b,c,d) chunk to memory -> this is slow!
  2. execute function to get results ( results = f(a,b,c,d) ) -> this could be GPU based
  3. write results and free memory (writing this down I realized that I can do this asynchronously I guess)
  4. move to next chunk

So, before I take the CUDA plunge I would want to know how I can resolve the disk reading / writing I/O problem, and optimize it in such a way that I can keep feeding the GPU enough data.

Any help would be appreciated.

Cheers,
Koen

It is very hard to give advice without knowledge of the dimensions and thus the volume of the data, and the amount and kind of processing per data element. GPUs are used with great success in any number of models that advance physical (and possibly chemical) state in a 3-D volume over time, such as atmospheric (for weather forecasting), molecular dynamics, combustion, or astronomical models. If necessary, the data is partitioned across GPUs such that all model state can be held in the on-board memory of the GPUs (typically around 12 GB per GPU today) throughout the simulation, and these simulations typically advance for thousands or even millions of time steps.

The first task would be to profile your existing modelling flow to determine precisely where the bottlenecks are, then address the tightest bottleneck first. If you have already conclusively determined that the current processing pipeline is I/O bound, you would want to focus on speeding up I/O. This could involve an algorithmic re-design to minimize data movement and increase arithmetic density, or simply investing in faster storage (for example, change from spinning storage to SSDs, or from consumer grade SSDs to server-class SSDs). You would also want to investigate the suitability of data compression methods for the data shuffled from/to mass storage. This could involve the use of generic compression techniques, or a data/application specific compression scheme, or both.

I’ve got 10 models and selected data (subset of the whole) currently consists of 66GB (of the total of 107 GB). So for the whole bunch this is heading towards 1TB of input data.

Each data cell consists of 100 years of daily values of temperature, precipitation and other environmental variables (~12400 z-axis). There are roughly 30,000 cells to be calculated.

The processing time per data element is rather short, but I should figure out the timing exactly (on CPU and GPU). As you mentioned, I’ll see where the true bottlenecks are but experience has thought me that it’s mainly I/O.

I rewrote the original code to eliminate any additional reading (passing multiple parameter sets for cross validation in one run, where I generate output for 30 parameter sets for the same data element). This basically kept my computing time acceptable (did increase but not orders of magnitude), while increasing the output 30-fold. I’ll look into the I/O in more detail, and consider SSDs.

I don’t know what kind of hardware you are currently using, but individual HDDs typically offer no more than 200 MB/sec throughput, consumer-grade SSDs up to 600 MB/sec, and enterprise-level SSDs can get you up to 3-5 GB/sec. Larger capacity SSDs frequently offer higher throughput due to the highly banked nature of this kind of storage (where larger capacity often implies a higher number of banks).

As for data compression, as a first step you might want to investigate the possibility of storing the data elements in half precision (FP16, about 3.3 decimal digits of accuracy) as this would seem to be generally suitable for the kind of data you are handling. Examples of domain-specific compression:

Xie, Xing, and Qianqing Qin. “Fast lossless compression of seismic floating-point data.” In Information Technology and Applications, 2009. IFITA’09. International Forum on, vol. 1, pp. 235-238. IEEE, 2009.

Pence, William D., R. L. White, and R. Seaman. “Optimal compression of floating-point astronomical images without significant loss of information.” Publications of the Astronomical Society of the Pacific 122, no. 895 (2010): 1065.

Thanks for the references,

It’s this kind of rather basic optimization information I need. I’m pretty adept at figuring stuff out but I lack a true CS background so I don’t know the literature well - or where to start.

Please note that I misstated the units of throughput for HDDs and consumer-grade SSDs. They are in MB/sec, not GB/sec. I have now fixed my previous post in that regard. These throughput numbers are for single devices. Combinations of devices, such as in a RAID system, may result in small multiples of the stated performance.

I have a degree in CS, but from working with many different third parties over about 25 years I can state that for optimization projects, having extensive domain knowledge and picking up the necessary bits of CS knowledge on-the-fly usually leads to much better results than having lots of CS knowledge and having to acquire domain knowledge on the fly. The best results are achieved by pairing domain specialists with CS experts for extended periods of time, leading to “cross pollination” and “synergy effects”.

As for getting started with exploring GPU acceleration, I would suggest to start by checking whether your current tools of choice already include GPU support. For example, your website seems to indicate that you use R in some of your projects, and there are CUDA bindings available for that (this is outside my area of expertise, but a quick Google search indicates that there are multiple such packages).

As for a starter GPU, assuming that you already have access to powerful workstation, you would want to avoid going too low end. I regularly encounter “complaints” from people who compare the performance of a very high-end CPU with a low-end GPU, and do not observe much of a speedup. If you can invest around $500 for an initial investigation, you could buy a pretty fast consumer GPU (such as a GTX 980 with 4 GB of memory) that should allow you to make a good initial assessment of GPU-accelerated computing. Be aware that consumer-grade GPUs offer great throughput for single-precision operations, but are usually quite limited in their double-precision throughput. I am assuming that based on your data (mostly environmental data from various sensors ?), single-precision processing is sufficient.

I figured out the unit mismatch :). I’m dabbling in GPU stuff for a DNN proof of concept. Currently I’m running a GTX 960 4GB (less cores, still reasonable memory wise). I was figuring out if I could do some deep learning work or at least re-purpose some existing models.

The latter was rather successful as I run the MIT scenes detection model to detect snow in image time series (http://phenocam.sr.unh.edu/webcam/gallery/). In other applications I’m already hitting memory limitations when running the Segnet DNN model to detect vegetation types (it works, but I need to GPU as compute unit only to access all memory, but then the cooling doesn’t run when the driver isn’t loaded - workarounds exists on linux).

So given both successes (if not frying the GPU in the process) of the proofs of concept there probably is room to upgrade to something like a GTX 980 6GB / TITAN X.

I’m not particularly bound to R, it’s convenient for fast development but very slow, the hooks to CUDA are still based upon functions written in C anyway. So I guess I would just migrate to C. Models in ecology (probably climatology and physics as welll) are often written in Fortran. My latest model was written in Fortran, but there doesn’t seem to be default Fortran support (only through PGI). Research has to be reproducible, so I like to keep things as ‘free’ as possible.

If I can hit a status quo on a low end card (GPU performance == CPU performance) I would have achieved a good starting point and be pretty happy. GPU scales better than CPU, so there would be headroom. In the mean time I’ll probably rewrite some of the memory handling which would speed up CPU performance if things don’t work out on GPU anyway.

So I think the first thing to figure out would be this more general issue of memory management and I/O speed up. Thanks for all the tips, and helping me think out loud!

Not sure what you mean by “default” Fortran support? If you are referring to a “free” (as in beer) Fortran compiler with CUDA support, I am not aware of any. NVIDIA gives out many tools for free that developers have to pay for with other processor companies, and NVIDIA’s development costs are presumably 100% subsidized by hardware sales. In that payment model it is understandable that not every niche can be covered by free tools, only the most common use cases.

I understand that Fortran still has traction in certain markets, and have some minimal working knowledge of Fortran myself, but for new, forward-looking development of numerical software I would suggest looking at C++ as it offers plenty of abstractions and from what I can see, more precise control over numerical aspects (such as the use of fused multiply-add, FMA).

I’m not complaining. I meant that the compiler which comes with the CUDA framework is free, while the Fortran version is 3th party and has license cost associated with it. You always pay for the hardware (or you don’t if you can hitch a ride on a cluster setup, as is often the case for academic setups). This makes C++ more accessible and reproducible as you don’t require anything but the hardware and the code.

In most cases I use the tools which make the most sense at the time so I’ll look at C++.