__global__ void add_in_gpu(int *A, int *B, int *C)
And I don’t think the shared memory is needed here. I think you can just say C[index] = A[index] + B[index]
Also take a loot at the .cubin file of your .cu to look if you go over some sort of boundary.
And if you are doing this take a look at cudaGetErrorString or something like that, to see if your kernel doesn’t give you an error.
But besides that if you divide 61000 by 257 you get 237.35 what are you doing with the other part? try your code with multiples of 256 first and see if it is still going wrong so try an array of 65536 elements
Thanks for the reply jordy. Yes u r right i can use *A, its just that im using a pointer to an array, so it really doesn’t make any difference. And im making use of shared memory to increase the program efficiency, if i use normal A[index], the program takes more time to execute as the dimension of the array increases.
I checked out the code with multiples of 256, it works fine for any value below 60000, infact it works for 60928, but for 61184 it gives the following pop up message:
The instruction at “0x1000d6db” referenced memory at “0x000000004”.The memory could not be “read”.
click on OK to terminate the program
And im new to CUDA, really don’t know how to see .cubin file.
which should be faster (you will probably not notice, since your kernel is memory-bound anyway)
Also your kernel cannot work correctly, as all threads of a block are overwriting the same shared-memory (at the same time). So what is being put in C can be the values A and B that should be done by another thread of the same block.
And if you do it in 1 line (the correct way), you should remove the shared int shm; definition.
But I think according to the error-message that you get, you are actually calling the kernel wrong, can you post the line where you call the kernel? My guess is that you input a variable as third argument (which indicates amount of shared memory used in the kernel that is not statically defined) When that gets too large you can expect your error-message.
I think it is best if you post the code you are currently using (including the code where you call the kernel)
Oh and btw. the way you wrote your code means you have to recompile everytime your input size changes, which is not necessary at all.
/* This program implements the addition of two arrays using threads in the GPU. */
// includes, system
// define the dimensions
#define DIM 2000 // Array Dimension
#define THX 8
#define THY 1
// Device Code
global void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])
// block index
//varibles in shared memory
__shared__ int shm1[blockx*THX+threadx];
__shared__ int shm2[blockx*THX+threadx];
/************************************** Main Program ********************************************/
//Define the Grids and Threads
// create the timer
unsigned int timer=0;
// initialize the arrays A & B
// print the arrays A & B
printf("\n Array A\n\n");
printf("\n Array B\n\n");
//ALLOCATE MEMORY IN GPU
//FROM MEMORY FROM HOST TO DEVICE
// start the timer and specify the no of iterations
// INVOKING KERNEL
// stop the timer and fetch the the timer value
// Result is copied to Host
// printing the resultant array
printf("\n The sum of two arrays in GPU\n\n");
printf("\n\nGPU Processing time: %f (ms)\n",(cutGetTimerValue(timer)));
//Free Device and Host Memory
Again, you are using shared memory, you should not because you do not need it & you are using it the wrong way (threadIdx.x & blockIdx.x are only known at runtime and when allocating like this the size has to be known at compilation time)
If you change you kernel code to :
__global__ void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])
// block index
(*C)[ blockx*THX+threadx]=(*A)[ blockx*THX+threadx]+(*B1)[ blockx*THX+threadx];
everything will work like it should.
You use shared memory when :
You need to communicate between the threads of a block
You need to access the same value from global memory from several threads in a block.
U r right Denis, i tried the code that u gave me and it is working fine, but i have two issues here:
Again the program is getting limited to 500000 elements in the array.
And my main concern here is to improve the efficiency of the program, by making use of shared memory. In the sample projects, matrix multiplication program has been written in a similar way, same concept im just trying to implement, isn’t it possible??
The program works for 80,000 elements in the array fine, but beyond 80,000 it gives unpredictable results, and for 90,000 it doesn’t execute at all. for 90,000 elements it becomes only 352 blocks, then why doesn’t it execute??
And when is it most preferable to make use of shared memory?
The cubin content for array dimension = 80,000 is: