Ok…I read NVIDIA’s CUDA guide and I know how threads are grouped and stuff like that. The thing that confuses me is thread’s and block’s indicies (threadIDx.x etc…) and the number of threads that I have to use?! Give me example, please, how can I add 2 arrays or matricies. I have experience in GPGPU using OpenGL + cg…
Thank you!
guide has an example, which is good;
sdk samples are all worth read.
take your time to read them all :)
Please, please, please write me kernel (pseudo code will be fine) for adding 2 arrays. In OpenGL/cg I had texture indicies and I don’t know how to do same thing with CUDA. I don’t understand things like: C[threadIDx.x * blockIDx.x + blockDim.x] Why like this? I didn’t find explained what are values of block’s/thread’s ID’s…
Thank you
Take a look at this thread:
[url=“http://forums.nvidia.com/index.php?showtopic=34309”]The Official NVIDIA Forums | NVIDIA
Ok…This is what bugs me:
global void add_arrays_gpu( float *in1, float *in2, float *out, int Ntot)
{
int idx=blockIdx.x*blockDim.x+threadIdx.x;
if ( idx <Ntot )
out[idx]=in1[idx]+in2[idx];
}
int idx=blockIdx.x*blockDim.x+threadIdx.x;
idx covers all elemetns from arrays. Let’s asumme that we have arrays of 16 members so values of idx will be 1 2 3 4 5…16? Am I right? Can u explain me why are u doing this blockIdx.x*blockDim.x+threadIdx.x?
I am so stucked with this…:(
You need to map from a local index to a global index. You know how many blocks you have and how big each block is.
Let’s assume you have an array of 8 elements, and you are using 2 blocks with 5 threads each.
Block 0:
blockIdx.x=0
blockDim.x=5
threadIdx.x= 0,1,2,3,4
idx will span: 0,1,2,3,4
Block 1:
blockIdx.x=1
blockDim.x=5
threadIdx.x= 0,1,2,3,4
idx will span: 5,6,7,8,9
So idx is covering the initial range plus some (and there is a check in the kernel to see if idx is outside the initial range).
Oh…Thanks a lot man! It is so much clearer now! 1 more question: What are values of threadIDx.y and blockIdx.y?
Edit: What is local index and global index? thread index and block’s/grid’s index?
This example is using a 1D decomposition, so threadIdx.y and blockIdx.y are both zero.
I am using global to refer to the index of the original problem, and local to refer to the index in the decomposed problem.
Tell me this:
If I have 64 blocks and 256 threads I have 4 thread in each block and have allocated arrays like this:
CUDA_SAFE_CALL(cudaMalloc((void**)&dinput, sizeof(float) * NUM_THREADS * 2));
CUDA_SAFE_CALL(cudaMalloc((void**)&doutput, sizeof(float) * NUM_BLOCKS));
CUDA_SAFE_CALL(cudaMalloc((void**)&dtimer, sizeof(clock_t) * NUM_BLOCKS * 2));
blockIDx.x = 0
threadIDx.x = 0 1 2 3
blockIDx.x = 1
threadIDx.x = 0 1 2 3
blockIDx.x = 2
threadIDx.x = 0 1 2 3
blockIDx.x = 3
threadIDx.x = 0 1 2 3
.
.
.
.
.
blockIDx.x = 63
threadIDx.x = 0 1 2 3
Is this correct???
What is good example of using threadIDx.y and blockIDx.y values in kernel? (SDK sample or some else)
Edit: …sample with 2d composition