Hello,
I’m working on a simple Cuda example with two 1-dimensional arrays, A and B. I’m filling A with 1s, B with 2s, and setting A[i]=A[i]+B[i]. The code works, but the problem is that it is fastest when I set:
dim3 blocks(1,1,1); dim3 threads(1,1,1);
It slows down when I add more blocks, and really slows down when I add more threads. For example, with 2 million elements, 1 block and 1 thread per block takes .125 seconds, but 1 block and 32 threads per block takes .296 seconds. If I put in 3907 blocks and 512 threads per block (so each element has one thread), the time shoots up to 20 seconds. The CPU time (with cuda memory operations left in for the sake of comparison) is .109 seconds.
I have a GeForce 460 SE card with 6 multiprocessors, maximum block dimensions 1024 (x) 1024 (y) and 64 (z), maximum grid dimensions 65535 (x) 65535 (y) and 1 (z), and warp size 32.
Any help would be very much appreciated.
Joy
–
global void add1darray(float A, float B, int N){
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int z = blockIdx.z * blockDim.z + threadIdx.z;
int offset=yblockDim.z+z+gridDim.yblockDim.yblockDim.zx;
while (offset<N) {
A[offset]+=B[offset];
offset+=blockDim.x*blockDim.y*blockDim.z*gridDim.x*gridDim.y ;
}
}
void addarrays(){
int i;
int N=2000000;
float *ad, bd;
int size=Nsizeof(float);
float ah=(float)malloc(size);
float bh=(float)malloc(size);
for (i=0 ; i<N; i++) {ah[i]=1.0f; }
for (i=0; i<N; i++) {bh[i]=2.0f; }
cudaMalloc((void **) &ad,size);
cudaMemcpy(ad, ah, size, cudaMemcpyHostToDevice);//copy to device
cudaMalloc((void **) &bd,size);
cudaMemcpy(bd, bh, size, cudaMemcpyHostToDevice);//copy to device
dim3 blocks(1,1,1); dim3 threads(1,1,1);
add1darray<<<blocks,threads>>>(ad,bd,N);
cudaMemcpy(ah, ad, size, cudaMemcpyDeviceToHost);
}