cuda fastest for one block, one thread?

Hello,

I’m working on a simple Cuda example with two 1-dimensional arrays, A and B. I’m filling A with 1s, B with 2s, and setting A[i]=A[i]+B[i]. The code works, but the problem is that it is fastest when I set:

dim3 blocks(1,1,1); dim3 threads(1,1,1);

It slows down when I add more blocks, and really slows down when I add more threads. For example, with 2 million elements, 1 block and 1 thread per block takes .125 seconds, but 1 block and 32 threads per block takes .296 seconds. If I put in 3907 blocks and 512 threads per block (so each element has one thread), the time shoots up to 20 seconds. The CPU time (with cuda memory operations left in for the sake of comparison) is .109 seconds.

I have a GeForce 460 SE card with 6 multiprocessors, maximum block dimensions 1024 (x) 1024 (y) and 64 (z), maximum grid dimensions 65535 (x) 65535 (y) and 1 (z), and warp size 32.

Any help would be very much appreciated.

Joy

global void add1darray(float A, float B, int N){
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
int offset=y
blockDim.z+z+gridDim.y
blockDim.yblockDim.zx;

      while (offset<N) {
      A[offset]+=B[offset];
    offset+=blockDim.x*blockDim.y*blockDim.z*gridDim.x*gridDim.y ;
}

}

void addarrays(){
int i;
int N=2000000;
float *ad, bd;
int size=N
sizeof(float);
float ah=(float)malloc(size);
float bh=(float)malloc(size);
for (i=0 ; i<N; i++) {ah[i]=1.0f; }
for (i=0; i<N; i++) {bh[i]=2.0f; }
cudaMalloc((void **) &ad,size);
cudaMemcpy(ad, ah, size, cudaMemcpyHostToDevice);//copy to device
cudaMalloc((void **) &bd,size);
cudaMemcpy(bd, bh, size, cudaMemcpyHostToDevice);//copy to device
dim3 blocks(1,1,1); dim3 threads(1,1,1);
add1darray<<<blocks,threads>>>(ad,bd,N);
cudaMemcpy(ah, ad, size, cudaMemcpyDeviceToHost);

}

Your offset calculation is a bit weird and gives you a bad memory access pattern.

int offset=y*blockDim.z+z+gridDim.y*blockDim.y*blockDim.z*x;

This will guarantee that adjacent threads is a warp access locations in memory that are far away from each other, and you entirely rely on the cache to sort it out.

Change that to

int offset = (z * gridDim.y * blockDim.y + y) * gridDim.x * blockDim.x + x;

and your code should run much faster.

Thank you! :) This is indeed much faster.