limit of computation

ashraf · March 26, 2008, 6:28am

Im writing an array addition program using CUDA which makes use of shared memory and it works fine for arrays of 60,000 elements each, but beyond that (for eg: 61000) it doesn’t give a proper result.

Is there any limit on the size of the arrays that we can compute? If yes what is the limit?
Can anybody help me out in this regard?

DenisR · March 26, 2008, 6:48am

You really need to post your code to see what is wrong, such a general question is unanswerable.

jordyvaneijk · March 26, 2008, 8:28am

It can be that you run against the max of your blocks or… something like that. Just post your code and we can have a look.

ashraf · March 26, 2008, 8:38am

k… here is the device function to add two arrays:

global void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

{

int threadx,thready,blockx,blocky;

// block index

blockx = blockIdx.x;

blocky = blockIdx.y;

//thread index

threadx=threadIdx.x;

thready=threadIdx.y;

// shared memory variable

__shared__ int shm[2];

// addition of arrays A & B1

shm[0] = (*A)[blockx*THX+threadx];

shm[1] = (*B1)[blockx*THX+threadx];

(*C)[blockx*THX+threadx] = shm[0]+shm[1];

__syncthreads();

}

Here A and B1 are two arrays to be added and the resultant to be put in C. THX is the size of the block.Execution parameters are:

            dim3 threads(THX,THY);

dim3 blocks(DIM/threads.x+1,1);

THX = 256, THY = 1, DIM = No of elements in the array

jordyvaneijk · March 26, 2008, 8:58am

k… here is the device function to add two arrays:

global void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

{
int threadx,thready,blockx,blocky;

// block index

blockx = blockIdx.x;

blocky = blockIdx.y;

//thread index

threadx=threadIdx.x;

thready=threadIdx.y;

// shared memory variable

__shared__ int shm[2];

// addition of arrays A & B1

shm[0] = (*A)[blockx*THX+threadx];

shm[1] = (*B1)[blockx*THX+threadx];

(*C)[blockx*THX+threadx] = shm[0]+shm[1];

__syncthreads();
}

Here A and B1 are two arrays to be added and the resultant to be put in C. THX is the size of the block.Execution parameters are:

dim3 threads(THX,THY);
dim3 blocks(DIM/threads.x+1,1);
THX = 256, THY = 1, DIM = No of elements in the array

[snapback]351203[/snapback]

Maybe a stupid question but what does the [DIM] do in this code?

__global__ void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

why not:

__global__ void add_in_gpu(int *A, int *B, int *C)

And I don’t think the shared memory is needed here. I think you can just say C[index] = A[index] + B[index]

Also take a loot at the .cubin file of your .cu to look if you go over some sort of boundary.

And if you are doing this take a look at cudaGetErrorString or something like that, to see if your kernel doesn’t give you an error.

But besides that if you divide 61000 by 257 you get 237.35 what are you doing with the other part? try your code with multiples of 256 first and see if it is still going wrong so try an array of 65536 elements

ashraf · March 27, 2008, 5:24am

Maybe a stupid question but what does the [DIM] do in this code?
__global__ void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])
why not:
__global__ void add_in_gpu(int *A, int *B, int *C)
And I don’t think the shared memory is needed here. I think you can just say C[index] = A[index] + B[index]

Also take a loot at the .cubin file of your .cu to look if you go over some sort of boundary.

And if you are doing this take a look at cudaGetErrorString or something like that, to see if your kernel doesn’t give you an error.

But besides that if you divide 61000 by 257 you get 237.35 what are you doing with the other part? try your code with multiples of 256 first and see if it is still going wrong so try an array of 65536 elements

[snapback]351208[/snapback]

Thanks for the reply jordy. Yes u r right i can use *A, its just that im using a pointer to an array, so it really doesn’t make any difference. And im making use of shared memory to increase the program efficiency, if i use normal A[index], the program takes more time to execute as the dimension of the array increases.

I checked out the code with multiples of 256, it works fine for any value below 60000, infact it works for 60928, but for 61184 it gives the following pop up message:

The instruction at “0x1000d6db” referenced memory at “0x000000004”.The memory could not be “read”.

click on OK to terminate the program

And im new to CUDA, really don’t know how to see .cubin file.

AndreiB · March 27, 2008, 7:25am

After kernel invocation add following:

cudaError rc;

rc = cudaThreadSynchronize();

if( cudaSuccess != rc ) {

    printf( "Kernel launch failure, error: %s", cudaGetErrorString(rc) );

}

You can use “nvcc -cubin” to compile your code into .cubin file or you may ask nvcc to tell you resuorce usage by adding “–ptxas-options=-v” command-line option.

DenisR · March 27, 2008, 8:23am

shm[0] = (*A)[blockx*THX+threadx];

shm[1] = (*B1)[blockx*THX+threadx];

(*C)[blockx*THX+threadx] = shm[0]+shm[1];

Can be substituted by

(*C)[blockx*THX+threadx] = (*A)[blockx*THX+threadx] + (*B1)[blockx*THX+threadx];

which should be faster (you will probably not notice, since your kernel is memory-bound anyway)

Also your kernel cannot work correctly, as all threads of a block are overwriting the same shared-memory (at the same time). So what is being put in C can be the values A and B that should be done by another thread of the same block.

ashraf · March 27, 2008, 11:58am

shm[0] = (*A)[blockx*THX+threadx];

shm[1] = (*B1)[blockx*THX+threadx];

(*C)[blockx*THX+threadx] = shm[0]+shm[1];
Can be substituted by
(*C)[blockx*THX+threadx] = (*A)[blockx*THX+threadx] + (*B1)[blockx*THX+threadx];
which should be faster (you will probably not notice, since your kernel is memory-bound anyway)

Also your kernel cannot work correctly, as all threads of a block are overwriting the same shared-memory (at the same time). So what is being put in C can be the values A and B that should be done by another thread of the same block.

[snapback]351888[/snapback]

I had tried that, again the same problem but this time it works only for 2000 elements in the array, beyond that it gives the following error:

ptxas error : Entry function ‘Z10add_in_gpuPA2100_iS0_S0_S0’ uses too much shared data (0x20d0 bytes + 0x20f0 bytes system, 0x4000 max)

jordyvaneijk · March 27, 2008, 12:14pm

Please post your .cubin information.

this can be gained by compiling your .cu file with “nvcc -cubin [filename].cu” There is some strange things happening here because you only use 1 array of ints (2) so that is 8bit :S

DenisR · March 27, 2008, 12:27pm

And if you do it in 1 line (the correct way), you should remove the shared int shm[2]; definition.

But I think according to the error-message that you get, you are actually calling the kernel wrong, can you post the line where you call the kernel? My guess is that you input a variable as third argument (which indicates amount of shared memory used in the kernel that is not statically defined) When that gets too large you can expect your error-message.

I think it is best if you post the code you are currently using (including the code where you call the kernel)
Oh and btw. the way you wrote your code means you have to recompile everytime your input size changes, which is not necessary at all.

ashraf · March 29, 2008, 4:12pm

But i have only two arguments in my kernel call, im not dynamically allocating the shared memory,

add_in_gpu<<<grids,threads>>>(device_a,device_b,device_c);

thanks everybody.

DenisR · March 30, 2008, 10:17am

well then you still have a shared memory definition in your kernel, since you get an error at compile time.

ashraf · March 30, 2008, 6:45pm

Here goes my complete code for your reference:

/* This program implements the addition of two arrays using threads in the GPU. */

// includes, system

include <stdio.h>

include <cutil.h>

// define the dimensions

define DIM 2000 // Array Dimension

define THX 8

define THY 1

// Device Code

global void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

{

int threadx,thready,blockx,blocky;

// block index

          blockx=blockIdx.x;

          blocky=blockIdx.y;

//thread index

threadx=threadIdx.x;

thready=threadIdx.y;

//varibles in shared memory

__shared__ int shm1[blockx*THX+threadx];

__shared__ int shm2[blockx*THX+threadx];

shm1[blockx*THX+threadx]=(*A)[ blockx*THX+threadx];

shm2[blockx*THX+threadx]=(*B1)[ blockx*THX+threadx];

__syncthreads();

(*C)[ blockx*THX+threadx]=shm1[blockx*THX+threadx]+shm2[blockx*THX+threadx];

__syncthreads();

}

/************************************** Main Program ********************************************/

int main()

{

//Define the Grids and Threads

	dim3 threads(THX,THY);

dim3 grids(DIM/threads.x+1,1);

//define dimensions

int(*device_b)[DIM];

int(*device_a)[DIM];

int(*device_c)[DIM];

int A[DIM];

int B[DIM],C[DIM],P[DIM];

int i,iter=50;

// create the timer

unsigned int timer=0;

CUT_SAFE_CALL(cutCreateTimer(&timer));

// initialize the arrays A & B

for(i=0;i<DIM;i++)

{

	A[i]=i+1;

	B[i]=i+2;

}

// print the arrays A & B

printf("\n Array A\n\n");

for(i=0;i<DIM;i++)

	printf("\t%d",A[i]);

printf("\n");

printf("\n Array B\n\n");

for(i=0;i<DIM;i++)

	printf("\t%d",B[i]);

//ALLOCATE MEMORY IN GPU

int size=sizeof(int)*DIM;

cudaMalloc((void**)&device_a,size);

cudaMalloc((void**)&device_b,size);

cudaMalloc((void**)&device_c,size);



//FROM MEMORY FROM HOST TO DEVICE 

cudaMemcpy(device_a,A,size,cudaMemcpyHostToDevice);

cudaMemcpy(device_b,B,size,cudaMemcpyHostToDevice);

// start the timer and specify the no of iterations

CUT_SAFE_CALL(cutStartTimer(timer));

for(int i=0;i<iter;i++)

{

	// INVOKING KERNEL

add_in_gpu<<<grids,threads>>>(device_a,device_b,device_c);

}

// stop the timer and fetch the the timer value

CUT_SAFE_CALL(cutStopTimer(timer));

// Result is copied to Host

cudaMemcpy(C,device_c,size,cudaMemcpyDeviceToHost);



// printing the resultant array

printf("\n");

printf("\n The sum of two arrays in GPU\n\n");

for(i=0;i<DIM;i++)

{

	printf("\t%d",C[i]);

}

printf("\n\nGPU Processing time: %f (ms)\n",(cutGetTimerValue(timer)));

printf("\n");

//Free Device and Host Memory

cudaFree(device_a);

cudaFree(device_b);

cudaFree(device_c);

}

thank you all once again.

DenisR · March 30, 2008, 7:09pm

Again, you are using shared memory, you should not because you do not need it & you are using it the wrong way (threadIdx.x & blockIdx.x are only known at runtime and when allocating like this the size has to be known at compilation time)

If you change you kernel code to :

__global__ void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

{

int threadx,thready,blockx,blocky;

// block index

blockx=blockIdx.x;

blocky=blockIdx.y;

//thread index

threadx=threadIdx.x;

thready=threadIdx.y;

(*C)[ blockx*THX+threadx]=(*A)[ blockx*THX+threadx]+(*B1)[ blockx*THX+threadx];

__syncthreads();

}

everything will work like it should.

You use shared memory when :

You need to communicate between the threads of a block
You need to access the same value from global memory from several threads in a block.

ashraf · March 31, 2008, 6:24am

Again, you are using shared memory, you should not because you do not need it & you are using it the wrong way (threadIdx.x & blockIdx.x are only known at runtime and when allocating like this the size has to be known at compilation time)

If you change you kernel code to :
__global__ void add_in_gpu(int(*A)[DIM],int(*B1)[DIM],int(*C)[DIM])

{

int threadx,thready,blockx,blocky;

// block index

blockx=blockIdx.x;

blocky=blockIdx.y;

//thread index

threadx=threadIdx.x;

thready=threadIdx.y;

(*C)[ blockx*THX+threadx]=(*A)[ blockx*THX+threadx]+(*B1)[ blockx*THX+threadx];

__syncthreads();

}
everything will work like it should.

You use shared memory when :

You need to communicate between the threads of a block

You need to access the same value from global memory from several threads in a block.

[snapback]353804[/snapback]

U r right Denis, i tried the code that u gave me and it is working fine, but i have two issues here:

Again the program is getting limited to 500000 elements in the array.
And my main concern here is to improve the efficiency of the program, by making use of shared memory. In the sample projects, matrix multiplication program has been written in a similar way, same concept im just trying to implement, isn’t it possible??

thank you,

ashraf · March 31, 2008, 6:53am

Here is the .cubin info jordy:

for array dimension: 200

architecture {sm_10}

abiversion {0}

code {

name = _Z10add_in_gpuPA200_iS0_S0_S0_

lmem = 0

smem = 1632

reg = 4

bar = 0

bincode  {

	0x10000005 0x0403c780 0x10004c01 0x0023c780 

	0xa0000405 0x04000780 0x40008101 0x00000013 

	0x20000209 0x04000780 0x30020401 0xc4100780 

	0x2000c805 0x04200780 0xd00e0205 0x80c00780 

	0x00020409 0xc0000780 0x2000ca0d 0x04200780 

	0x08001001 0xe4204780 0xd00e0605 0x80c00780 

	0x20108409 0x0000000f 0x00020405 0xc0000780 

	0x0801a001 0xe4204780 0x861ffe03 0000000000 

	0x00000009 0xc0000780 0x1500e004 0x2100ec00 

	0x2800d005 0x04204780 0xd00e0005 0xa0c00780 

	0x861ffe03 0000000000 0xf0000001 0xe0000001 

}

}

for array dimension: 3000

architecture {sm_10}

abiversion {0}

code {

name = _Z10add_in_gpuPA3000_iS0_S0_S0_

lmem = 0

smem = 24032

reg = 4

bar = 0

bincode  {

	0x10000005 0x0403c780 0x10004c01 0x0023c780 

	0xa0000405 0x04000780 0x40008101 0x00000013 

	0x20000209 0x04000780 0x30020401 0xc4100780 

	0x2000c805 0x04200780 0xd00e0205 0x80c00780 

	0x00020409 0xc0000780 0x2000ca0d 0x04200780 

	0x08001001 0xe4204780 0xd00e0605 0x80c00780 

	0x20008409 0x000000bf 0x00020405 0xc0000780 

	0x08178001 0xe4204780 0x861ffe03 0000000000 

	0x00000009 0xc0000780 0x1500e004 0x2100ec00 

	0x2800d005 0x04204780 0xd00e0005 0xa0c00780 

	0x861ffe03 0000000000 0xf0000001 0xe0000001 

}

}

thank you,

Sarnath · March 31, 2008, 9:41am

This basically means that your CPU program is having some problem.

Take a look at your CPU code. Probably you are overshooting a memcpy or sthg like that…

AndreiB · March 31, 2008, 10:49am

You’re most likely exceeding maximum size of block and/or grid. Blocks cannot be larger than 512 threads and max. grid size is 65535 in each dimension (it is 2D).

Your code won’t benefit from using shared memory because you read it only once.

ashraf · April 1, 2008, 9:20am

The program works for 80,000 elements in the array fine, but beyond 80,000 it gives unpredictable results, and for 90,000 it doesn’t execute at all. for 90,000 elements it becomes only 352 blocks, then why doesn’t it execute??

And when is it most preferable to make use of shared memory?

The cubin content for array dimension = 80,000 is:

architecture {sm_10}

abiversion {0}

code {

name = _Z10add_in_gpuPA80000_iS0_S0_S0_

lmem = 0

smem = 32

reg = 3

bar = 0

bincode  {

	0x10000005 0x0403c780 0x10004c01 0x0023c780 

	0xa0000405 0x04000780 0x40008101 0x00000013 

	0x20000201 0x04000780 0x30020009 0xc4100780 

	0x2102e804 0x2102ea00 0xd00e0205 0x80c00780 

	0xd00e0001 0x80c00780 0x20008200 0x2102ec04 

	0xd00e0201 0xa0c00780 0x861ffe03 0000000000 

	0xf0000001 0xe0000001 

}

}

and for array dimension = 90,000

architecture {sm_10}

abiversion {0}

code {

name = _Z10add_in_gpuPA90000_iS0_S0_S0_

lmem = 0

smem = 32

reg = 3

bar = 0

bincode  {

	0x10000005 0x0403c780 0x10004c01 0x0023c780 

	0xa0000405 0x04000780 0x40008101 0x00000013 

	0x20000201 0x04000780 0x30020009 0xc4100780 

	0x2102e804 0x2102ea00 0xd00e0205 0x80c00780 

	0xd00e0001 0x80c00780 0x20008200 0x2102ec04 

	0xd00e0201 0xa0c00780 0x861ffe03 0000000000 

	0xf0000001 0xe0000001 

}

}

thank you all.

Topic		Replies	Views
Are there memory limitations on Device when using large arrays? Tesla C1060 CUDA Programming and Performance	40	15255	April 22, 2009
Urgent help with threads please! CUDA Programming and Performance	21	11082	March 6, 2008
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18569	October 17, 2013
Kernel configuration and maximum array size problem. CUDA Programming and Performance	8	6592	January 25, 2009
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13525	January 24, 2009
multi dimension array CUDA Programming and Performance	26	33130	February 12, 2010
Some confusion on using shared memory. CUDA Programming and Performance	26	9460	June 2, 2009
Multiplying two arrays CUDA Programming and Performance	6	5341	May 7, 2008
Shared Memory initialization CUDA Programming and Performance	19	45568	March 26, 2007
Problems of matrix multiplication With and without CUDA CUDA Programming and Performance	15	10190	January 18, 2012

limit of computation

Related topics