limit of computation

Im writing an array addition program using CUDA which makes use of shared memory and it works fine for arrays of 60,000 elements each, but beyond that (for eg: 61000) it doesn’t give a proper result.

Is there any limit on the size of the arrays that we can compute? If yes what is the limit?
Can anybody help me out in this regard?

You really need to post your code to see what is wrong, such a general question is unanswerable.

It can be that you run against the max of your blocks or… something like that. Just post your code and we can have a look.

k… here is the device function to add two arrays:

global void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

{

int threadx,thready,blockx,blocky;

// block index

blockx = blockIdx.x;

blocky = blockIdx.y;

//thread index

threadx=threadIdx.x;

thready=threadIdx.y;

// shared memory variable

__shared__ int shm[2];

// addition of arrays A & B1

shm[0] = (*A)[blockx*THX+threadx];

shm[1] = (*B1)[blockx*THX+threadx];

(*C)[blockx*THX+threadx] = shm[0]+shm[1];

__syncthreads();

}

Here A and B1 are two arrays to be added and the resultant to be put in C. THX is the size of the block.Execution parameters are:

            dim3 threads(THX,THY);

dim3 blocks(DIM/threads.x+1,1);

THX = 256, THY = 1, DIM = No of elements in the array

Maybe a stupid question but what does the [DIM] do in this code?

__global__ void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

why not:

__global__ void add_in_gpu(int *A, int *B, int *C)

And I don’t think the shared memory is needed here. I think you can just say C[index] = A[index] + B[index]

Also take a loot at the .cubin file of your .cu to look if you go over some sort of boundary.

And if you are doing this take a look at cudaGetErrorString or something like that, to see if your kernel doesn’t give you an error.

But besides that if you divide 61000 by 257 you get 237.35 what are you doing with the other part? try your code with multiples of 256 first and see if it is still going wrong so try an array of 65536 elements

Thanks for the reply jordy. Yes u r right i can use *A, its just that im using a pointer to an array, so it really doesn’t make any difference. And im making use of shared memory to increase the program efficiency, if i use normal A[index], the program takes more time to execute as the dimension of the array increases.

I checked out the code with multiples of 256, it works fine for any value below 60000, infact it works for 60928, but for 61184 it gives the following pop up message:

The instruction at “0x1000d6db” referenced memory at “0x000000004”.The memory could not be “read”.

click on OK to terminate the program

And im new to CUDA, really don’t know how to see .cubin file.

After kernel invocation add following:

cudaError rc;

rc = cudaThreadSynchronize();

if( cudaSuccess != rc ) {

    printf( "Kernel launch failure, error: %s", cudaGetErrorString(rc) );

}

You can use “nvcc -cubin” to compile your code into .cubin file or you may ask nvcc to tell you resuorce usage by adding “–ptxas-options=-v” command-line option.

shm[0] = (*A)[blockx*THX+threadx];

shm[1] = (*B1)[blockx*THX+threadx];

(*C)[blockx*THX+threadx] = shm[0]+shm[1];

Can be substituted by

(*C)[blockx*THX+threadx] = (*A)[blockx*THX+threadx] + (*B1)[blockx*THX+threadx];

which should be faster (you will probably not notice, since your kernel is memory-bound anyway)

Also your kernel cannot work correctly, as all threads of a block are overwriting the same shared-memory (at the same time). So what is being put in C can be the values A and B that should be done by another thread of the same block.

I had tried that, again the same problem but this time it works only for 2000 elements in the array, beyond that it gives the following error:

ptxas error : Entry function ‘Z10add_in_gpuPA2100_iS0_S0_S0’ uses too much shared data (0x20d0 bytes + 0x20f0 bytes system, 0x4000 max)

Please post your .cubin information.

this can be gained by compiling your .cu file with “nvcc -cubin [filename].cu” There is some strange things happening here because you only use 1 array of ints (2) so that is 8bit :S

And if you do it in 1 line (the correct way), you should remove the shared int shm[2]; definition.

But I think according to the error-message that you get, you are actually calling the kernel wrong, can you post the line where you call the kernel? My guess is that you input a variable as third argument (which indicates amount of shared memory used in the kernel that is not statically defined) When that gets too large you can expect your error-message.

I think it is best if you post the code you are currently using (including the code where you call the kernel)
Oh and btw. the way you wrote your code means you have to recompile everytime your input size changes, which is not necessary at all.

But i have only two arguments in my kernel call, im not dynamically allocating the shared memory,

add_in_gpu<<<grids,threads>>>(device_a,device_b,device_c);

thanks everybody.

well then you still have a shared memory definition in your kernel, since you get an error at compile time.

Here goes my complete code for your reference:

/* This program implements the addition of two arrays using threads in the GPU. */

// includes, system

#include <stdio.h>

#include <cutil.h>

// define the dimensions

#define DIM 2000 // Array Dimension

#define THX 8

#define THY 1

// Device Code

global void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

{

int threadx,thready,blockx,blocky;

// block index

          blockx=blockIdx.x;

          blocky=blockIdx.y;

//thread index

threadx=threadIdx.x;

thready=threadIdx.y;

//varibles in shared memory

__shared__ int shm1[blockx*THX+threadx];

__shared__ int shm2[blockx*THX+threadx];

shm1[blockx*THX+threadx]=(*A)[ blockx*THX+threadx];

shm2[blockx*THX+threadx]=(*B1)[ blockx*THX+threadx];

__syncthreads();

(*C)[ blockx*THX+threadx]=shm1[blockx*THX+threadx]+shm2[blockx*THX+threadx];

__syncthreads();

}

/************************************** Main Program ********************************************/

int main()

{

//Define the Grids and Threads

	dim3 threads(THX,THY);

dim3 grids(DIM/threads.x+1,1);

//define dimensions

int(*device_b)[DIM];

int(*device_a)[DIM];

int(*device_c)[DIM];

int A[DIM];

int B[DIM],C[DIM],P[DIM];

int i,iter=50;

// create the timer

unsigned int timer=0;

CUT_SAFE_CALL(cutCreateTimer(&timer));

// initialize the arrays A & B

for(i=0;i<DIM;i++)

{

	A[i]=i+1;

	B[i]=i+2;

}

// print the arrays A & B

printf("\n Array A\n\n");

for(i=0;i<DIM;i++)

	printf("\t%d",A[i]);

printf("\n");

printf("\n Array B\n\n");

for(i=0;i<DIM;i++)

	printf("\t%d",B[i]);

//ALLOCATE MEMORY IN GPU

int size=sizeof(int)*DIM;

cudaMalloc((void**)&device_a,size);

cudaMalloc((void**)&device_b,size);

cudaMalloc((void**)&device_c,size);



//FROM MEMORY FROM HOST TO DEVICE 

cudaMemcpy(device_a,A,size,cudaMemcpyHostToDevice);

cudaMemcpy(device_b,B,size,cudaMemcpyHostToDevice);

// start the timer and specify the no of iterations

CUT_SAFE_CALL(cutStartTimer(timer));

for(int i=0;i<iter;i++)

{

	// INVOKING KERNEL

add_in_gpu<<<grids,threads>>>(device_a,device_b,device_c);

}

// stop the timer and fetch the the timer value

CUT_SAFE_CALL(cutStopTimer(timer));

// Result is copied to Host

cudaMemcpy(C,device_c,size,cudaMemcpyDeviceToHost);



// printing the resultant array

printf("\n");

printf("\n The sum of two arrays in GPU\n\n");

for(i=0;i<DIM;i++)

{

	printf("\t%d",C[i]);

}

printf("\n\nGPU Processing time: %f (ms)\n",(cutGetTimerValue(timer)));

printf("\n");

//Free Device and Host Memory

cudaFree(device_a);

cudaFree(device_b);

cudaFree(device_c);

}

thank you all once again.

Again, you are using shared memory, you should not because you do not need it & you are using it the wrong way (threadIdx.x & blockIdx.x are only known at runtime and when allocating like this the size has to be known at compilation time)

If you change you kernel code to :

__global__ void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

{

int threadx,thready,blockx,blocky;

// block index

blockx=blockIdx.x;

blocky=blockIdx.y;

//thread index

threadx=threadIdx.x;

thready=threadIdx.y;

(*C)[ blockx*THX+threadx]=(*A)[ blockx*THX+threadx]+(*B1)[ blockx*THX+threadx];

__syncthreads();

}

everything will work like it should.

You use shared memory when :

  • You need to communicate between the threads of a block

  • You need to access the same value from global memory from several threads in a block.

U r right Denis, i tried the code that u gave me and it is working fine, but i have two issues here:

  1. Again the program is getting limited to 500000 elements in the array.

  2. And my main concern here is to improve the efficiency of the program, by making use of shared memory. In the sample projects, matrix multiplication program has been written in a similar way, same concept im just trying to implement, isn’t it possible??

thank you,

Here is the .cubin info jordy:

for array dimension: 200

architecture {sm_10}

abiversion {0}

code {

name = _Z10add_in_gpuPA200_iS0_S0_S0_

lmem = 0

smem = 1632

reg = 4

bar = 0

bincode  {

	0x10000005 0x0403c780 0x10004c01 0x0023c780 

	0xa0000405 0x04000780 0x40008101 0x00000013 

	0x20000209 0x04000780 0x30020401 0xc4100780 

	0x2000c805 0x04200780 0xd00e0205 0x80c00780 

	0x00020409 0xc0000780 0x2000ca0d 0x04200780 

	0x08001001 0xe4204780 0xd00e0605 0x80c00780 

	0x20108409 0x0000000f 0x00020405 0xc0000780 

	0x0801a001 0xe4204780 0x861ffe03 0000000000 

	0x00000009 0xc0000780 0x1500e004 0x2100ec00 

	0x2800d005 0x04204780 0xd00e0005 0xa0c00780 

	0x861ffe03 0000000000 0xf0000001 0xe0000001 

}

}

for array dimension: 3000

architecture {sm_10}

abiversion {0}

code {

name = _Z10add_in_gpuPA3000_iS0_S0_S0_

lmem = 0

smem = 24032

reg = 4

bar = 0

bincode  {

	0x10000005 0x0403c780 0x10004c01 0x0023c780 

	0xa0000405 0x04000780 0x40008101 0x00000013 

	0x20000209 0x04000780 0x30020401 0xc4100780 

	0x2000c805 0x04200780 0xd00e0205 0x80c00780 

	0x00020409 0xc0000780 0x2000ca0d 0x04200780 

	0x08001001 0xe4204780 0xd00e0605 0x80c00780 

	0x20008409 0x000000bf 0x00020405 0xc0000780 

	0x08178001 0xe4204780 0x861ffe03 0000000000 

	0x00000009 0xc0000780 0x1500e004 0x2100ec00 

	0x2800d005 0x04204780 0xd00e0005 0xa0c00780 

	0x861ffe03 0000000000 0xf0000001 0xe0000001 

}

}

thank you,

This basically means that your CPU program is having some problem.

Take a look at your CPU code. Probably you are overshooting a memcpy or sthg like that…

You’re most likely exceeding maximum size of block and/or grid. Blocks cannot be larger than 512 threads and max. grid size is 65535 in each dimension (it is 2D).

Your code won’t benefit from using shared memory because you read it only once.

The program works for 80,000 elements in the array fine, but beyond 80,000 it gives unpredictable results, and for 90,000 it doesn’t execute at all. for 90,000 elements it becomes only 352 blocks, then why doesn’t it execute??

And when is it most preferable to make use of shared memory?

The cubin content for array dimension = 80,000 is:

architecture {sm_10}

abiversion {0}

code {

name = _Z10add_in_gpuPA80000_iS0_S0_S0_

lmem = 0

smem = 32

reg = 3

bar = 0

bincode  {

	0x10000005 0x0403c780 0x10004c01 0x0023c780 

	0xa0000405 0x04000780 0x40008101 0x00000013 

	0x20000201 0x04000780 0x30020009 0xc4100780 

	0x2102e804 0x2102ea00 0xd00e0205 0x80c00780 

	0xd00e0001 0x80c00780 0x20008200 0x2102ec04 

	0xd00e0201 0xa0c00780 0x861ffe03 0000000000 

	0xf0000001 0xe0000001 

}

}

and for array dimension = 90,000

architecture {sm_10}

abiversion {0}

code {

name = _Z10add_in_gpuPA90000_iS0_S0_S0_

lmem = 0

smem = 32

reg = 3

bar = 0

bincode  {

	0x10000005 0x0403c780 0x10004c01 0x0023c780 

	0xa0000405 0x04000780 0x40008101 0x00000013 

	0x20000201 0x04000780 0x30020009 0xc4100780 

	0x2102e804 0x2102ea00 0xd00e0205 0x80c00780 

	0xd00e0001 0x80c00780 0x20008200 0x2102ec04 

	0xd00e0201 0xa0c00780 0x861ffe03 0000000000 

	0xf0000001 0xe0000001 

}

}

thank you all.