Techniques for Very Large Datasets?

I’ve got a new little project I want to work on. I will probably use CUDA to accelerate the calculations, since the project will involve some VERY large matrices (by my estimates, the matrix will take up ~250GB).

Since I’m working with an 8800GT with 512MB memory and 4GB system memory, what techniques do you guys use to break up such projects into smaller pieces for CUDA to work on, without running into the memory limitations of the card or computer? Also, there is a third constraint here, which is disk access time (since the data will have to be paged back and forth to disk, from the main memory).

I thought about getting a new Core i7 box with 12GB of RAM, but obviously that still doesn’t even make much of a difference with that much data (other than saving a few extra page read/writes to disk). One partial solution for speedup, I think, will be to get an SSD or two to page the data to (since they are much faster than mechanical hard disks).

I studied numerical analysis in undergrad, but I’ve never done anything even close to this magnitude before, so any pointers (CUDA, or otherwise) would be quite helpful.

My reply is almost off-topic, however, it might help you for your problem.

Our company developed solvers for dense compressed matrices that can solve a dense 160K x 160K matrix on one GTX 260 and we expect to be able to solve 500K x 500K matrices on one Tesla C1060.

Please, have a look at:…af-0800200c9a66

and, in case of interest, contact me directly at

Best regards


The first thing you should do is writing a pure C implementation (or C++, for that matter). Then get an idea if going CUDA will help at all.
Chances are, it will take you longer to fetch and store a block of data then it will take to process it. In that case, focus on your data representation on HDD and on raw HDD speed. Consider on-the-fly compression. If you ever get to the point of being compute-bound, have a look at CUDA. But don’t spend energy on it unless it will actually give you a benefit.

It totally depends on the nature of computation. But anyway, here are my 2 cents:

Say, if you can break the matrix calculation into many sub-matrices that can be independently computed then – you can Overlap Disk process time for bringing input matrix with CUDA computation on another sub-matrix.

One needs to calibrate the delays and arrive at the optimum sub-matrix that would perfecty overlap.

This will yeild significant speedup compared to CPU.

Also consider "MMAP"ing the matrix into your application’s virtual address space to avoid “fread, fwrite” and the likes. This way, "cudaMemcpy"s from the matrix to the card will actually fault and bring in the data from the disk. Also you would need a 64-bit application, OS and a 64-bit file system as well.

Thanks for the responses! Yes, I’m using Vista 64-bit, so I can allocate all the memory I need (as Sarnath mentioned). I think I’m going to do as T.B. said and write the pure C implementation so I can tune the algorithm on smaller matrices that will fit into memory (say, 2GB in size). Then I’ll upscale and add code to do a blocked implementation where ~2GB chunks of the matrix are written/read from the disk as needed. If the disk accesses make it ridiculously slow (which I imagine they will), I might try getting a 8GB or 16GB flash drive and using that as a buffer to read/write the matrix from, since I could handle the copy/overwrite of the ‘used’ chunks in the background.