limit of computation

You are using more shared memory than is allowed…

I think you already solved this problem at this time by not using any shared memory inside the kernel ?

Also for some testing purposes it is good to put this code after the kernel invocation

cudaThreadSynchronize();

cudaError_t error = cudaGetLastError();

if (error != cudaSuccess)

	printf("error :%s\n",cudaGetErrorString(error));

If you run your application and you don’t see anything you maybe getting an error from the kernel, try it.

ya but that again is getting limited to 80,000 elements in the array, i even tried testing if there is any error from kernel, but as soon as i execute the kernel i get a window popping up saying program encountered a problem and it has to be closed, it doesn’t display any error message.

thank you,

Because you are still using shared memory, even though everybody keeps telling you not to.

Look at post #15 in this thread.

No im not using shared memory, this is my device code:

{

            int threadx,blockx;

blockx = blockIdx.x;

threadx=threadIdx.x;

(C)[blockxTHX+threadx] = (A)[blockxTHX+threadx]+ (B)[blockxTHX+threadx];

__syncthreads();

}

As can be seen in his .cubin he uses 24k of shared mem. This is to much. But I don’t think this causes this kind of trouble. I also had those problems in the beginning with to much shared memory. And I never had this problem. The only problem I encountered while using too much shared memory was the output of my computation didn’t make sense.

When you use too much shared memory, there is no computation (but there is an error raised if you check with CUT_CHECK_ERROR), so the results in your output will be the values that happened to be in the memory when the memory was allocated.

The cubin you posted cannot be correct, this is what I get when compiling your device function :

nvcc -c dummy.cu -o dummy.o --ptxas-options=-v
ptxas info : Compiling entry function ‘Z11dummykernelPfS_S
ptxas info : 0 bytes lmem, 40 bytes smem, 13191888 bytes cmem, 3 registers

Also, the fact that the values differ for different array sizes mean that you are either posting an old cubin, or there is something else wrong.

Can you please post (between [ code ] [ /code ] tags) your kernel and host code? Then I will try it out.

Maybe if you post all your files with the makefile as a archive we can take a look at what is going wrong.

This way we do the same as you… And we can make .cubin files ourself and make sure your code is right.

thanks for all your suggestions, well here is my complete code:

/* This program implements the addition of two arrays

   using threads in the GPU. */

// includes, system  

#include <stdio.h>

#include <cutil.h>

// define the dimensions 

#define THX 256

#define THY 1

#define DIM 90000

// Device Code

__global__ void add_in_gpu(int(*A)[DIM],int(*B)[DIM],int(*C)[DIM])

{

	int threadx,blockx;

    

	// block index

	blockx = blockIdx.x;

	

	//thread index

	threadx=threadIdx.x;

	

	(*C)[blockx*THX+threadx] = (*A)[blockx*THX+threadx]+ (*B)[blockx*THX+threadx];

	__syncthreads();

}

/************************************** Main Program ********************************************/

int main()

{

	//Define the Grids and Threads

	dim3 threads(THX,THY);

	dim3 grids(DIM/threads.x+1,1);

	//define dimensions

	int(*device_b)[DIM];

	int(*device_a)[DIM];

	int(*device_c)[DIM];

	int A[DIM];

	int B[DIM],C[DIM];

	int i,iter=50;

	// create the timer

	unsigned int timer=0;

	CUT_SAFE_CALL(cutCreateTimer(&timer));

	// initialize the arrays A & B

	for(i=0;i<DIM;i++)

	{

  A[i]=1;

  B[i]=2;

	}

	// print the arrays A & B

	printf("\n Array A\n\n");

	for(i=0;i<DIM;i++)

  printf("\t%d",A[i]);

	printf("\n");

	printf("\n Array B\n\n");

	for(i=0;i<DIM;i++)

  printf("\t%d",B[i]);

	//ALLOCATE MEMORY IN GPU

	int size=sizeof(int)*DIM;

	cudaMalloc((void**)&device_a,size);

	cudaMalloc((void**)&device_b,size);

	cudaMalloc((void**)&device_c,size);

	

	//FROM MEMORY FROM HOST TO DEVICE 

	cudaMemcpy(device_a,A,size,cudaMemcpyHostToDevice);

	cudaMemcpy(device_b,B,size,cudaMemcpyHostToDevice);

	// start the timer and specify the no of iterations

	CUT_SAFE_CALL(cutStartTimer(timer));

	for(int i=0;i<iter;i++)

	{

  // INVOKING KERNEL

  add_in_gpu<<<grids,threads>>>(device_a,device_b,device_c);

 cudaThreadSynchronize();

  cudaError_t error = cudaGetLastError();

  if (error != cudaSuccess)

  printf("error :%s\n",cudaGetErrorString(error));

	}

	// stop the timer and fetch the the timer value

	CUT_SAFE_CALL(cutStopTimer(timer));

	// Result is copied to Host

	cudaMemcpy(C,device_c,size,cudaMemcpyDeviceToHost);

	

	// printing the resultant array

	printf("\n");

	printf("\n The sum of two arrays in GPU\n\n");

	for(i=0;i<DIM;i++)

	{

  printf("%d\t%d\n",i,C[i]);

	}

	printf("\n\nGPU Processing time: %f (ms)\n",(cutGetTimerValue(timer)));

	printf("\n");

	//Free Device and Host Memory

	cudaFree(device_a);

	cudaFree(device_b);

	cudaFree(device_c);

}

thank you everyone…

Hi Ashraf…

After testing your complete code it all goes fine by me. I set the DIM to 1M and no problems.

I attach the testLog I made with the output of the program

Ohh nevermind i cannot attach the log it is 13MB :P

The sum of two arrays in GPU

0	3

1	3

2	3

3	3

4	3

5	3

6	3

7	3

8	3

9	3

10	3

...

...

...

999990	3

999991	3

999992	3

999993	3

999994	3

999995	3

999996	3

999997	3

999998	3

999999	3

Hi jordy,

which platform r u executing this program on?? coz i execute it on windows xp, MS visual studio 2005 IDE, but i don’t know why im not able to execute, is it related to platform??

I execute this on Linux (Fedora Core 6) CUDA 1.0 on a GTS 320MB

/home/jcvaneijk/Desktop/testfolder/Ashraf% make

nvcc --compile   -I/usr/local/cuda/include -I../../../work/MAKEHOME/.. -I/usr/local/cuda/include -I/usr/X11R6/include -I/usr/local/Trolltech/Qt-4.3.1/include -I/usr/local/Trolltech/Qt-4.3.1/include/QtXml -I/usr/local/Trolltech/Qt-4.3.1/include/QtNetwork -I/usr/local/Trolltech/Qt-4.3.1/include/QtCore   main.cu -o release-LX/main.o

"main.cu", line 14: warning: parameter "C" was set but never used

  int (*C)[1000000]){

        ^

       Creating application identification

        Linking executable release-LX/testAshraf 

g++  release-LX/.Id.o release-LX/main.o   -L../../../work/MAKEHOME/lib/release-LX -L/usr/local/cuda/lib -lcuda -lcudart  -o release-LX/testAshraf

        Stripping executable release-LX/testAshraf

strip release-LX/testAshraf

        Making link ../../../work/MAKEHOME/install/release-LX/testAshraf to release-LX/testAshraf

This is my output. when I “make” This is compiled by our in house standard makefile so it is using some special things but the major thing is the nvcc call and after that the g++

Stripping my code down it looks something like this

/home/jcvaneijk/Desktop/testfolder/Ashraf% nvcc --compile -I/usr/local/cuda/include main.cu -o release-LX/main.o

"main.cu", line 14: warning: parameter "C" was set but never used

  int (*C)[1000000]){

        ^

/home/jcvaneijk/Desktop/testfolder/Ashraf% g++ release-LX/main.o -L/usr/local/cuda/lib -lcuda -lcudart -o release-LX/testAshraf

I hope this is helpful

Thank you so much for your help buddy, im actually using win xp, i think it’s something related to memory taken by GPU to display but im not sure what the exact problem is, maybe ill get some help in CUDA on win xp section.

Try not to cross post they don’t like that. Most of the people look in all the forums I think. At least I do because the most of the topics in those forums are not platform dependent.

Edit: typo

What card are you using (how much mem?)

What you can do is add checks (CUT_CHECK_ERROR macro) after your memory allocations, to see if they fail. Or just check if the returnvalue of CudaMalloc is non-zero.

But yes, having not enough free (contiguous) memory can also be a problem off course when going to big numbers, that never crossed my mind…

Ok, this is a very long confusing thread and nobody seems to have a clue what the problem is (myself included).

Here is a guess I haven’t seen anyone else mention:

You declare your array input to the kernel like this: int(*A)[DIM].

  1. WHY??? You just need int *A and then calculate an index into it using a pitch. For coalescing performance you should allocate with cudaMallocPitch to get the pitch to be the right value.

  2. I’m not positive, but I THINK the [DIM] is what is leading to your massive shared memory usage. Kernel arguments are stored in shared memory, so declaring a kernel call with some thousands of pointers as arguments will cause problems.

I also pointed the first one about the (*A) out to Ashraf in my first thread and about that coalescing memory I don’t know anything about :P

ya u r right Denis, the problem is with cudamalloc when i increase the no of elements, i checked and im getting an unknown error:

Cuda error in file ‘Arrayadd.cu’ in line 69 : unknown error.

and regarding using (*A) instead of *A, i have changed it but no change in the program behavior.

thank you,

Oh wait, I guess I was mistaken in one of my assumptions. You aren’t even using 2D arrays, just 1D. Hence there is no need for cudaMallocPitch.

Thus I am even more confused why you are declaring the kernel arguments the way you are. Why are you trying to pass in an array of 90,000 pointers to integers when all you need is one pointer to index a 1D array of memory?

The kernel should be:

__global__ void add_in_gpu(int *A, int *B, int *C)

{

int idx = blockIdx.x * blockDim.x + threadIdx.x;

C[idx] = A[idx]+ B[idx];

}

And you can allocate the device pointers like this:

int *device_a, *device_b, *device_c;

//ALLOCATE MEMORY IN GPU

int size=sizeof(int)*DIM;

cudaMalloc((void**)&device_a,size);

cudaMalloc((void**)&device_b,size);

cudaMalloc((void**)&device_c,size);

Edit: and there should be no problems mallocing 90,000 ints on the device. That’s only a few hundred kilobytes :) If you want to pin down where errors are occuring, you need to wrap every CUDA call with CUDA_SAFE_CALL and put CUT_CHECK_ERROR after every kernel call.

I was actually using a pointer to a 1D array, but however i have changed it to a pointer, but still im having the same problem. My program is not able to allocate memory for large values, im wondering if it is something related to the screen resolution??

Thanks.

One more thing is that, if i add CUT_DEVICE_INIT() to my program, error is issued at this line, exactly where CUT_DEVICE_INIT() is present.