Array Indexing in CUDA

Greetings all,

I am trying to understand the way an Array in CUDA is indexed. So for that reason I have taken the vectoradd program available in NVIDIA CUDA Samples. I am pasting below the code that I edited

__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
	printf("%u\n", i);
        C[i] = A[i] + B[i];

The main function calls vectorAdd as follows:

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);

The blocksPerGrid is set to 196 in main and threadsPerBlock is 256, numElements is set to 50000. When I run the above code I see that it prints the indices (variable i) of 4096 invocations. From what I understand I would expect a total of 196*256 = 50176 invocations. Am I missing something ?

Further more when I examine the indices (variable i) printed in my above code I see that the indices range from 45000-48000, an index lesser than this value is not present. Interestingly when I set the blocksPerGrid to 1, I see the index value ranges from 0 to 255. Note that these indices are not present when I run with larger blocksPerGrid (=196 which is the default). I would expect that the indices touched by a smaller blocksPerGrid (=1) would be a subset of a larger blocksPerGrid (=196). Why is there such a huge difference between the indices when different blocksPerGrid value is used.

Furthermore in other programs I see that the index of an array goes to more than 1109846, while the size of the array is 16. How are such array indices accessed ?

the reason probably is just limited size of CUDA printf buffer. try that:

if (i%100==0) printf("%u\n", i);

you should see all i values from 0 to 50100

You might want to read the documentation concerning in-kernel printf.
It’s in the programming guide, at