streams in kernels

Sam_Adams · December 17, 2007, 9:28pm

I am still more used to writing code with Cg than CUDA, so I am still trying to figure out the best way to run and write the code using CUDA. This is what I am understanding from CUDA. Please correct me if I am incorrect at any point, and please provide the correct way of thinking about it.

Seems like the most effiecent way to use compute streams is still simular to the old FBO ways. Lets say we have a sream (1D array) and wanted to add 2 to each element. The most effective way to do this is to create two arrays, one for the input and one for the output. Like in the old days, you can just think of all of the elements be computed at the same time. Something like this.

__global__ k(float *in, float *out){

    int idx = blockDim.x * blockIdx.x + threadIdx.x;

    out[idx] = in[idx] + 2.0;

}

The particular index of the 1D array that a give thread is working on is related to the execution configuration specified in the kernel invocation. The ideal execution configuration is GPU specific.
This part I am less sure about, how you know what the addres in the array the given thread is at. This is what I have gathered from reading other posts, but I don’t think I really understand how it works.

blockDim.x * blockIdx.x + threadIdx.x

How does that really work?

tomschi · December 17, 2007, 9:58pm

hello,

for your kind of stream computation that seems perfectly ok.

blockDim.x * blockIdx.x + threadIdx.x

you just use the first coordinate because it’s 1-dimensional. when executing a kernel, you have to the execution configuration which contains this two-level grid of block and threads. at the coarse level you have blocks of threads, which are composed of many individual threads at the fine level. to find a particular thread you have to specify the block where it is in (blockIdx) and its location within the block (threadIdx). so when blocks are blockDim sized, this leads to the formula above.

tomschi

seibert · December 18, 2007, 12:34am

Seems like the most effiecent way to use compute streams is still simular to the old FBO ways. Lets say we have a sream (1D array) and wanted to add 2 to each element. The most effective way to do this is to create two arrays, one for the input and one for the output. Like in the old days, you can just think of all of the elements be computed at the same time. Something like this.
__global__ k(float *in, float *out){

    int idx = blockDim.x * blockIdx.x + threadIdx.x;

    out[idx] = in[idx] + 2.0;

}

You could also write back to the in array, if you don’t need the original data after running this kernel.

sicb0161 · January 8, 2008, 2:54pm

can you really write back to the same array within a kernel ???

wumpus · January 8, 2008, 2:58pm

Yes, as long as the reads and writes don’t conflict. In this case the thread writes the same location that it read, so it is possible.

sicb0161 · January 8, 2008, 3:18pm

ohhh thanks a lot , so read / write conflict means thread idx does not write into the same location where it read the data.

Are the execution points of read and write important, must a write follow a read instruction or can different instructions be between those two comments ?

wumpus · January 8, 2008, 3:24pm

With a conflict I mean that a thread reads a memory position that this (or another) thread wrote before in the same kernel invocation. This would give unpredictable results.

Apart from that, there are really no restrictions in the ordering of memory accesses.