Just starting and with a question on a excercise

Hello, I’m just starting with CUDA, and altough I did made some stuff with OpenMP and Beowolf clusters, that was a very long time ago.

I’m reading the tutorials at Dr.Dobbs and on part 2 they have a very simple array increment code:

// incrementArray.cu

#include <stdio.h>

#include <assert.h>

#include <cuda.h>

void incrementArrayOnHost(float *a, int N)


  int i;

  for (i=0; i < N; i++) a[i] = a[i]+1.f;


__global__ void incrementArrayOnDevice(float *a, int N)


  int idx = blockIdx.x*blockDim.x + threadIdx.x;

  if (idx<N) a[idx] = a[idx]+1.f;


int main(void)


  float *a_h, *b_h;		   // pointers to host memory

  float *a_d;				 // pointer to device memory

  int i, N = 10;

  size_t size = N*sizeof(float);

  // allocate arrays on host

  a_h = (float *)malloc(size);

  b_h = (float *)malloc(size);

  // allocate array on device 

  cudaMalloc((void **) &a_d, size);

  // initialization of host data

  for (i=0; i<N; i++) a_h[i] = (float)i;

  // copy data from host to device

  cudaMemcpy(a_d, a_h, sizeof(float)*N, cudaMemcpyHostToDevice);

  // do calculation on host

  incrementArrayOnHost(a_h, N);

  // do calculation on device:

  // Part 1 of 2. Compute execution configuration

  int blockSize = 4;

  int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

  // Part 2 of 2. Call incrementArrayOnDevice kernel 

  incrementArrayOnDevice <<< nBlocks, blockSize >>> (a_d, N);

  // Retrieve result from device and store in b_h

  cudaMemcpy(b_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

  // check results

  for (i=0; i<N; i++) assert(a_h[i] == b_h[i]);

  // cleanup

  free(a_h); free(b_h); cudaFree(a_d);

One of the questions to enhance it, is this one:

I don’t know if I’m just being too naive, but I can’t seem to find a good answer for this, could someone help me out on it?

Thank you very much.

Perhaps the key to working it out is to understand why it can’t work correctly for an arbitrary sized array. As it is written the code launches 4 threads per block (let’s put aside that 4 threads per block is a very poor choice) and as many blocks as required in a 1D grid, rounded up so that the whole array is processed. The upper limit of grid size in any dimension on all current hardware is 65535. So in its present form, the code can only correctly process arrays up to 65535*4 in size.

The question is inviting you to think how to remove that limit by adding a loop (I am assuming inside the kernel, because the limitation could equally be removed by running the kernel multiple times inside a host loop). So the design changes from each thread processing only one entry, to each thread processing multiple entries in a loop, with the number of iterations per thread calculated from the array size and the number of blocks and threads per block. So the host code might do something like choose values for the number threads per block and number of blocks, and then divide the array size through by both to give the number of iterations per thread. You then also need to think about what to do when they are not evenly divisible (padding the input array to an even multiple of block and threads per block is probably the easiest way).

Does that help?

Hum, that sure helps, the only problem then would be to find out how much each thread should loop, and how to get this amount on each thread correctly (this would be the hardest part for me I think).

If I understood your reply correctly then, I should do something like this to find the amount of loop on each thread:

int blockSize = 4;

  int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

  int amount = N/(blockSize*nBlocks);

is this correct?

How would I fetch an N amount of the array in the Kernel?

Thanks for the reply ;)

No. If you work through the mathematics of that code, you should find amount is either 0 or 1 for all N, which is obviously not what you want. Set both BlockSize and nBlocks to fixed values, then calculate amount from them and N.

In this case you wouldn’t. Imagine the following situation with BlockSize=4 and nBlocks=2 (again these are poor choices, but easy to illustrate), with N=32;

Block:				0		|		1

Thread:		 0   1   2   3  |  0   1   2   3


Iteration 1:	0   1   2   3  |  4   5   6   7

Iteration 2:	8   9  10   11 | 12  13  14  15

Iteration 3:   16  17  18   19 | 20  21  22  23

Iteration 4:   24  25  26   27 | 28  29  30  31

The first iteration is exactly equal to the way the code works now with N=8. The second is also equivalent to the way the code currently works with N=8, except now work starts at a new place in the array. The pattern continues until the whole array is processed. So thread 0 processes elements 0,8,16,24, thread 1 processes elements 1,9,17,25 and so on. So to do this, the new kernel code should wrap the current calculation in a loop where each iteration adds an offset to the thread index.

Short of writing out the code I can’t make it any clearer than that, I am afraid.

I see, could I say that something like this would work then?

The kernel:

__global__ void incrementArrayOnDevice(float *a, int N, int amount, int slack)


  int i,idx;



	idx=blockDim.x*blockIdx.x + threadIdx.x+(i*slack);

	if(idx < N) a[idx]=a[idx]+1.f;



And for the calling of the kernel:

int blockSize = 4;

  int nBlocks = 2;

  int slack = blockSize*nBlocks;

  int amount = N/slack + (blockSize%N==0 ? 0:1);


I tested this with N=655352, 655353, 655362 and 655363 and it worked, is it correct?


The code certainly isn’t how I would have written it, but the idea is on the right track. Correctness is way beyond the scope of this discussion, I think.

hehehe, can you share how would you have written it?