execution configuration & too large steams what is the best way to do this?

Please correct me if I am wrong in any of my assumptions.

  1. execution configuration

I think I am starting to get a hang of this such that I am now executing kernels with that streams that fully utilize all of the multiprocessors on the GPU. For best results, it looks like the grid size should be a multiple of the number of multiprocessors. In my case I have an 8800 Ultra, so that means I have 16 multiprocessors and my gird sizes should be multiples of 16. For the block size, from the occupancy calculator, it looks like 256 is optimal size. It seems like there might also be some smaller sizes that would work too like 192, etc. So, for best results, the streams that I need to compute on need to be multiples of 4096 (16 * 256) elements.

So what I am wondering, is there any advantage to not using 1D sizes for the grids and blocks? My speculation is that it would basically run the same as long as the aggregate size is still the same as mentioned above (i.e. block size could be 16x16). I am also speculating that physically it is a 1D representation in hardware anyway in which case I might as well keep everything 1D. The other though is perhaps a 2D representation would run better since it would map more closely to graphics and textures better which GPUs are obviously designed for.

  1. streams that are too large for the device memory

I am writing a finite element code that typically run model spaces that will require more memory that a single GPU has. Generally models will be 512^3 cells or more, and each cell requires: 4 * float3, and 1 * byte/char. So obviously that would require way too much memory. For each cell, at least 70 floating point operations per kernel execution will occur.

Can anyone give a recommendation of the best way to efficiently marshal the memory such that I can keep the GPU busy all of the time?

This is the best solution that I can think of, but if you have a better idea, all the better. Basically I would make the biggest streams possible to basically completely fill up the device’s memory and execute the kernel on that large stream. Then, I would have a 2 buffers on the host of the same size (perhaps page-locked memory?) to save the result, and copy up the next stream to compute. The rest of the huge stream I would keep in host memory and transfer in and out as needed.

Another variation could be to break up the device memory in to something like fourths, and basically do the same thing but only on a part of the device memory.

Thanks for your suggestions.

So for 512^3 cells, that’s 1.7 GB of memory. The Tesla cards have 1.5 GB of memory, so if you can partition the problem, then a pair of Tesla cards will do the trick.

Of course, that works until you have to make your cell size 20% smaller in each dimension, and then you’re back to having to stream data to the card again.

How long does each iteration of the computation take? Newer CUDA cards (sadly not the GTX or Tesla) can overlap memory transfers and kernel execution. That could also help to mitigate the time spent moving data on and off the card.