Im writing an array addition program using CUDA which makes use of shared memory and it works fine for arrays of 60,000 elements each, but beyond that (for eg: 61000) it doesn’t give a proper result.
Is there any limit on the size of the arrays that we can compute? If yes what is the limit?
Can anybody help me out in this regard?
__global__ void add_in_gpu(int *A, int *B, int *C)
And I don’t think the shared memory is needed here. I think you can just say C[index] = A[index] + B[index]
Also take a loot at the .cubin file of your .cu to look if you go over some sort of boundary.
And if you are doing this take a look at cudaGetErrorString or something like that, to see if your kernel doesn’t give you an error.
But besides that if you divide 61000 by 257 you get 237.35 what are you doing with the other part? try your code with multiples of 256 first and see if it is still going wrong so try an array of 65536 elements
Thanks for the reply jordy. Yes u r right i can use *A, its just that im using a pointer to an array, so it really doesn’t make any difference. And im making use of shared memory to increase the program efficiency, if i use normal A[index], the program takes more time to execute as the dimension of the array increases.
I checked out the code with multiples of 256, it works fine for any value below 60000, infact it works for 60928, but for 61184 it gives the following pop up message:
The instruction at “0x1000d6db” referenced memory at “0x000000004”.The memory could not be “read”.
click on OK to terminate the program
And im new to CUDA, really don’t know how to see .cubin file.
You can use “nvcc -cubin” to compile your code into .cubin file or you may ask nvcc to tell you resuorce usage by adding “–ptxas-options=-v” command-line option.
which should be faster (you will probably not notice, since your kernel is memory-bound anyway)
Also your kernel cannot work correctly, as all threads of a block are overwriting the same shared-memory (at the same time). So what is being put in C can be the values A and B that should be done by another thread of the same block.
this can be gained by compiling your .cu file with “nvcc -cubin [filename].cu” There is some strange things happening here because you only use 1 array of ints (2) so that is 8bit :S
And if you do it in 1 line (the correct way), you should remove the shared int shm[2]; definition.
But I think according to the error-message that you get, you are actually calling the kernel wrong, can you post the line where you call the kernel? My guess is that you input a variable as third argument (which indicates amount of shared memory used in the kernel that is not statically defined) When that gets too large you can expect your error-message.
I think it is best if you post the code you are currently using (including the code where you call the kernel)
Oh and btw. the way you wrote your code means you have to recompile everytime your input size changes, which is not necessary at all.
global void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])
{
int threadx,thready,blockx,blocky;
// block index
blockx=blockIdx.x;
blocky=blockIdx.y;
//thread index
threadx=threadIdx.x;
thready=threadIdx.y;
//varibles in shared memory
__shared__ int shm1[blockx*THX+threadx];
__shared__ int shm2[blockx*THX+threadx];
shm1[blockx*THX+threadx]=(*A)[ blockx*THX+threadx];
shm2[blockx*THX+threadx]=(*B1)[ blockx*THX+threadx];
/************************************** Main Program ********************************************/
int main()
{
//Define the Grids and Threads
dim3 threads(THX,THY);
dim3 grids(DIM/threads.x+1,1);
//define dimensions
int(*device_b)[DIM];
int(*device_a)[DIM];
int(*device_c)[DIM];
int A[DIM];
int B[DIM],C[DIM],P[DIM];
int i,iter=50;
// create the timer
unsigned int timer=0;
CUT_SAFE_CALL(cutCreateTimer(&timer));
// initialize the arrays A & B
for(i=0;i<DIM;i++)
{
A[i]=i+1;
B[i]=i+2;
}
// print the arrays A & B
printf("\n Array A\n\n");
for(i=0;i<DIM;i++)
printf("\t%d",A[i]);
printf("\n");
printf("\n Array B\n\n");
for(i=0;i<DIM;i++)
printf("\t%d",B[i]);
//ALLOCATE MEMORY IN GPU
int size=sizeof(int)*DIM;
cudaMalloc((void**)&device_a,size);
cudaMalloc((void**)&device_b,size);
cudaMalloc((void**)&device_c,size);
//FROM MEMORY FROM HOST TO DEVICE
cudaMemcpy(device_a,A,size,cudaMemcpyHostToDevice);
cudaMemcpy(device_b,B,size,cudaMemcpyHostToDevice);
// start the timer and specify the no of iterations
CUT_SAFE_CALL(cutStartTimer(timer));
for(int i=0;i<iter;i++)
{
// INVOKING KERNEL
add_in_gpu<<<grids,threads>>>(device_a,device_b,device_c);
}
// stop the timer and fetch the the timer value
CUT_SAFE_CALL(cutStopTimer(timer));
// Result is copied to Host
cudaMemcpy(C,device_c,size,cudaMemcpyDeviceToHost);
// printing the resultant array
printf("\n");
printf("\n The sum of two arrays in GPU\n\n");
for(i=0;i<DIM;i++)
{
printf("\t%d",C[i]);
}
printf("\n\nGPU Processing time: %f (ms)\n",(cutGetTimerValue(timer)));
printf("\n");
//Free Device and Host Memory
cudaFree(device_a);
cudaFree(device_b);
cudaFree(device_c);
Again, you are using shared memory, you should not because you do not need it & you are using it the wrong way (threadIdx.x & blockIdx.x are only known at runtime and when allocating like this the size has to be known at compilation time)
If you change you kernel code to :
__global__ void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])
{
int threadx,thready,blockx,blocky;
// block index
blockx=blockIdx.x;
blocky=blockIdx.y;
//thread index
threadx=threadIdx.x;
thready=threadIdx.y;
(*C)[ blockx*THX+threadx]=(*A)[ blockx*THX+threadx]+(*B1)[ blockx*THX+threadx];
__syncthreads();
}
everything will work like it should.
You use shared memory when :
You need to communicate between the threads of a block
You need to access the same value from global memory from several threads in a block.
U r right Denis, i tried the code that u gave me and it is working fine, but i have two issues here:
Again the program is getting limited to 500000 elements in the array.
And my main concern here is to improve the efficiency of the program, by making use of shared memory. In the sample projects, matrix multiplication program has been written in a similar way, same concept im just trying to implement, isn’t it possible??
You’re most likely exceeding maximum size of block and/or grid. Blocks cannot be larger than 512 threads and max. grid size is 65535 in each dimension (it is 2D).
Your code won’t benefit from using shared memory because you read it only once.
The program works for 80,000 elements in the array fine, but beyond 80,000 it gives unpredictable results, and for 90,000 it doesn’t execute at all. for 90,000 elements it becomes only 352 blocks, then why doesn’t it execute??
And when is it most preferable to make use of shared memory?
The cubin content for array dimension = 80,000 is: