where to loop

I am just wondering where the best place to do loop is. I am used to packing my data into textures and then redering a quad and having my kernel in a fragment program. My paradigm back then was to just think that everything in the quad is being computed at the same time. I was able to get over 100x speed up vs running on the CPU. I am trying to figure out which algorithm is simular to my old shader paradigm that performed well. So basically I am storing data to be worked on in 3kB streams as 1D arrays. Is it best to create the internal computation on the 1D array in:

  1. the host code launching a kernel for each element.

  2. a kernel for each array and loop through with a for loop in the device code

  3. do it some completely diffrent way

elements|iterations|looping on host|looping on device|

(float3)|          | (time in s)   | (time in s)     |

--------|----------|---------------|-----------------|

3x3x3   |1,000     |6.80           |6.89             |

3x3x3   |10,000    |67.25          |68.30            |

10x10x10|1,000     |67.78          |821.57           |

20x20x20|1,000     |397.88         |1760.48          |

------------------------------------------------------

I am basically doing 4 floating point opperations on the three elements of each cell. The final code will do like 40. These test are really basic and I am not sure if they really mean anything, but it looks like you want to look on the host and not the device. But still the speed up doesn’t seem as great as when I did a fragment shader program, but I havn’t tested it to be sure.

CUDA performance should be similar, if not better, than shader performance since they use similar hardware (with caveats for 3D textures, etc). You may not be utilizing CUDA correctly.

You can still think of CUDA in the same way that you did for fragment shaders. A CUDA kernel is similar to a fragment shader program. Then you setup the grid and blocks so that the CUDA runtime “loops” for you. And you can continue to think of your entire array being processed at the same time.

What is the correct way to set up the grid and blocks to make it work that way?

This simple kernel multiplies every element by 2, and demonstrates how to set up the grid. Note also that memory reads in this example are coalesced as long as the block size is a multiple of 32 (which it should be anyways…)

__global__ void mul2_kernel(float *d_in, float *d_out)

    {

    int idx = blockDim.x * blockIdx.x + threadIdx.x;

    d_out[idx] = d_in[idx] * 2;

    }

You would call this kernel with a block size of your choosing (vary the block size to find the best performance). And a number of blocks equal to ceil(N/block_size). Of course, it assumes that your arrays have been padded so there are no writes past the end of an array. You can add an if (idx< N) to the kernel (in order to avoid needing to pad) without seeing any performance loss.

So it looks like you are just doing your opperation on the entire array at once. Is that correct? That makes it feel like the old shader concept.

It is just like the old shader concept, but for “simple” kernels what more do you need? Even in this simple form, it is more powerful than shaders. 1) You get to control memory access patters = good memory performance. 2) Perhaps you need to calculate 9 different out values (i.e. position, velocity, and acceleration of particles. You can do this easily even in this “simple shader approach” in CUDA by writing to many out variables. I’ve never written a shader before, but can’t they only write one float4 output?

It also should feel like the old shader concept. After all, shaders are data-parallel and the GPU is built around a data-parallel model. The more threads you can fire up doing independent things, the better. Shared memory, block synchronization and other non-shader CUDA concepts are useful, but only for algorithms where they are really needed. If a problem can be broken down into a full grid of independent threads with no intercommunication, that is likely to be the fastest implementation.