Is CUDA right for me? (FDTD) FDTD user needs fast computations while handling massive 3-D arrays

Hi,

I have been coding FDTD acoustic simulations using MATLab. I have run into two problems in my work. The simulations run slowly and I require lots of memory.

I can solve my memory problem by moving to a 64-bit operating system, but now that I can make my arrays as big as I want I quickly find that the time it takes to run large systems takes months for what I really want to be doing.

So in searching for a faster way to compute my simulations without breaking the bank on the latest processors I am looking to do some GPU computing. But I am unfamiliar with how GPU computing works and utilizes memory. In my program I allocate 5 arrays each over 100,000,000 elements in size with hopes of going a couple magnitudes larger in the future. Can CUDA handle this much data at one time and still be fast enough, and if so do I need a card with a certain amount of memory on it ? For instance the 8500Gt comes in 256mb, 512mb and 1Gb, would I need to have a 1GB card to get the largest arrays or can the computer pass data from the RAM or hard drive as it goes along?

I haven’t done FDTD processing for audio, but it feels like it’s 1D signal processing in the time domain, so it may fit the CUDA model quite nicely. (3D electrodynamics FDTD processing is quite complex but still possible in CUDA. 1D would be much easier!) A lot depends on if your signal interactions (linear or nonlinear) are still constrained to time brackets… if you can fit all your computes to sliding windows of say 1000 samples wide or so it’s in fact QUITE nice to process by streaming data through the GPU cores in parallel.
But that answer depends very much on the exact processing you’re doing… perhaps you’re doing just simple filtering in which case there’s definitely nice solutions.

But as for your main question, yes, you can buy CUDA devices with up to 4GB of RAM. But again your algorithm may make a big difference, do you really need the whole dataset loaded at once, or just a window you can slide?

If your processing is very minor, but your data is very large, it can become quite common to be PCIE bandwidth limited. You can only pass about 3-6 GB/s over the PCIE bus. Often this problem is negligible, but for some processing where only trivial operations are applied, it becomes the defining bottleneck.
In your case if you’re sending 5 arrays of 100,000,000 floats over to the card, that’s going to take a good fraction of a second just to send the data over.

There are many commercially available FDTD codes running on GPUs (e.g. do a Google search). For 3D problems, the speed improvement is typically 20-30 times, provided that the entire problem can fit on the GPU. A 1GB card will handle 3D problems with 20,000,000 cells easily. For 100,000,000 cells, you might be able to fit it on a 4GB card if you program carefully. Otherwise, you will have to swap data between GPU and CPU, which will result in huge performance penalty.

Well I am doing 3-D modeling of the sound field in a rectangular listening room. So it is exactly like 3d E&M FDTD except that when its acoustics you are dealing with a scaler pressure field and a velocity vector field thus you need on one array for the pressure points and 3 vector components for the particle velocity.

Suffice it to say I am very new to alot of the terminology for GPU processing but I took from your reply that it is possible but that I have to send the info in chunks. Also the 4GB CUDA devices are way out of my price range seeing as this is a side interest to go along with my research so I am wondering if this can be done using some of the more cost efective devices. Ones hopefully under $150

The wonderful thing about CUDA and data-parallel applications is that they scale nicely when you give them faster hardware. So you can do your initial research on a slower CUDA card (i.e. the GTX 260 or 9800) with smaller runs. When your code is tuned and you have your 20x performance gains on this hardware then you’ll know it will be worth the extra cost to scale up to larger problems on the 4 GiB Teslas.

I imagine that the difference between sound and E&M is that sound is slower. You may be able to break up your room into pieces, and recombine them periodically (handling the boundaries cleverly). But since I don’t know much about FDTD, that’s for you to figure out. You need to figure out how to handle the constant swapping of data to and from CPU RAM, in a way that the GPU can still do a good amount of work inbetween swaps.

Depending on the parameters of your FDTD, you could split the data set across multiple GPUs and hide the inter-GPU communication cost with asynchronous memcopies. For details on how to implement 3D FDTD for regular grids efficiently in CUDA (as well as single and multi-GPU perf numbers), check out the CUDA optimization talk given during the Supercoputing 08 tutorial:
http://www.gpgpu.org/sc2008/
Paulius

I am impressed by the nearly perfect scaling across multiple GPUs. But can you explain what the following statement in your code does?

MPI_Sendrecv( h_ghost_own, num_ghost_elmnts, MPI_REAL, partner, i, h_ghost_partner, num_ghost_elmnts, MPI_REAL, partner, i, MPI_COMM_WORLD, &status);

Do you mean MPI in general? MPI is used for copying data between compute nodes over a network.

I mean, in the following code snippet, the function MPI_Sendrecv() is one which I could not find in the CUDA Programming Guide (or maybe it is a custom function that handles the data copying you mentioned):

for (int i=0; i<num_time_steps; i++)

{

  launch_kernel( d_output+offset1, d_input+offset1, dimx,demy,12,stream1);

  launch_kernel( d_output+offset2, d_input+offset2, dimx,demy,dimz,stream2);

  cudaMemcpyAsync( h_ghost_own, d_ghost_own, num_ghost_bytes, cudaMemcpyDeviceToHost, stream1);

  cudaStreamsSynchronize( stream1);

  MPI_Sendrecv( h_ghost_own, num_ghost_elmnts, MPI_REAL, partner, i, h_ghost_partner, num_ghost_elmnts, MPI_REAL, partner, i, MPI_COMM_WORLD, &status);

   cudaMemcpyAsync( d_ghost_partner, h_ghost_partner, num_ghost_bytes, cudaMemcpyHostToDevice, stream1);

   cudaThreadSynchronize();

}

MPI_Sendrecv is part of MPI ( Message Passaging Interface), the standard library for message-passing.
If you are not familiar with it, this is a good link:

http://www-unix.mcs.anl.gov/mpi/

Thanks, that’s helpful.

Thank you everyone for the info thus far. Its been very helpful. I just have one final question. I have the following

%This sets the size of the room
%assume a P matrix with MxNxO
M=451; %width in cm
N=801; %depth in cm
O=251; %height in cm

Pold=zeros(M,N,O); “each array is about 700mb large”
Pnew=zeros(M,N,O);
Uoldx=zeros(M+1,N,O);
Unewx=zeros(M+1,N,O);
Uoldy=zeros(M,N+1,O);
Unewy=zeros(M,N+1,O);
Uoldz=zeros(M,N,O+1);
Unewz=zeros(M,N,O+1);

c= constant

for I=1 to M
for J=1 to N
for K=1 to O
.
.
.
Pnew(I,J,K)= Pold(I,J,K) + c*(Uoldx(I,J,K) + Uoldy(I,J,K) + Uoldz(I,J,K)) “calculation of Pnew requires data from 4 other arrays”
.
.
.
end
end
end

So my question is, because each array that I would need to access data from in the step inside the ‘for’ loop is about 700mb then I imagine I would need to store over 2.8GB of data in the GPU’s RAM. Using a 4gb tesla card would work but I only have a 1GB 9500gt card available. Is there any way to simply read the data from another location, like on the hard disk while performing the computations with the GPU ?

as you are using the same coordinate on every array, you could load them e.g. slice-by-slice or some slices at once, compute, get the result, load the next slices, etc.
or did i miss something?

in float they are half the size, but still 1.4 Gb
I would advise to indeed do some slicing or something like that. If you are going to call CUDA from matlab, you will have all the data available in your mex file, so you can split things up.
That will also be necessary if you are going to have larger rooms, after a while it will also not fit Tesla anymore.

This is obviuosly not the full code. I’m sure at least Unew takes into account neighbor elements.

But if the full code is not much more complex than this, then forget any sort of device-host transfers. Either all the data will stay in GPU RAM, or you might as well just do it on the CPU. (An i7 should work with such an algorithm well.)

To go beyond the 1GB limit, you can install 2 cards and transfer just the boundary conditions between them each timestep. (Ie, each card would have its own piece of data.) But if you use a 9500, you won’t get much speedup because its DDR isn’t all that faster than your CPU’s. DDR bandwidth is the overwhelming bottleneck of your alg.

No that was not the full code, but it is not much more complicated at all. I can provide it though. But the line in question looks like

Pnew(:,:,:)=Pold(:,:,:)+ccrhoK/h(Uoldx(1:M,:,:)-Uoldx(1+1:M+1,:,:)+Uoldy(:,1:N,:)-Uoldy(:,1+1:N+1,:)+Uoldz(:,:,1:O)-Uoldz(:,:,1+1:O+1));

You are saying that I will not see much speed up with CUDA because of the DDR bandwidth being a bottleneck for me and that I should go to an i7 processor. I find this odd because I look at this paper here http://smadasam.googlepages.com/Hpcmp-ugc07.pdf that indicates a significant speed up using an old Geforce Go 7400 card. I’m not all that familiar with the i7 but the cost for me is out of the question at this time because this is a side project for me and my professor won’t pay for such an upgrade, so I’m stuck using my Pentium D 940.

So what would you suggest that would give me the best benefit, more memory on a card or to find one with DDR3 ?

A GTX260 will have over 10x the DDR bandwidth of your Pentium D. (A 9500… 2-3x) So you can still get a big boost. But then you also want a huge problem set size. On a CPU you can get 8GB of RAM for $80, which is much more than 10x cheaper than amassing the equivalent in GPUs.

But, actually, I think it may still be possible to block the computation, but to overcome the host-device bottleneck you’ll need to carry out dozens of timesteps on each slice before copying to/from CPU RAM. With each timestamp you have to shrink the slice at its edges (since you don’t have enough data to compute them), and afterwards do another pass that calculates the missing part. If you can figure this part out, you may have something on your hands. If not, then you won’t.