ya but that again is getting limited to 80,000 elements in the array, i even tried testing if there is any error from kernel, but as soon as i execute the kernel i get a window popping up saying program encountered a problem and it has to be closed, it doesn’t display any error message.
As can be seen in his .cubin he uses 24k of shared mem. This is to much. But I don’t think this causes this kind of trouble. I also had those problems in the beginning with to much shared memory. And I never had this problem. The only problem I encountered while using too much shared memory was the output of my computation didn’t make sense.
When you use too much shared memory, there is no computation (but there is an error raised if you check with CUT_CHECK_ERROR), so the results in your output will be the values that happened to be in the memory when the memory was allocated.
The cubin you posted cannot be correct, this is what I get when compiling your device function :
nvcc -c dummy.cu -o dummy.o --ptxas-options=-v
ptxas info : Compiling entry function ‘Z11dummykernelPfS_S’
ptxas info : 0 bytes lmem, 40 bytes smem, 13191888 bytes cmem, 3 registers
Also, the fact that the values differ for different array sizes mean that you are either posting an old cubin, or there is something else wrong.
Can you please post (between [ code ] [ /code ] tags) your kernel and host code? Then I will try it out.
thanks for all your suggestions, well here is my complete code:
/* This program implements the addition of two arrays
using threads in the GPU. */
// includes, system
#include <stdio.h>
#include <cutil.h>
// define the dimensions
#define THX 256
#define THY 1
#define DIM 90000
// Device Code
__global__ void add_in_gpu(int(*A)[DIM],int(*B)[DIM],int(*C)[DIM])
{
int threadx,blockx;
// block index
blockx = blockIdx.x;
//thread index
threadx=threadIdx.x;
(*C)[blockx*THX+threadx] = (*A)[blockx*THX+threadx]+ (*B)[blockx*THX+threadx];
__syncthreads();
}
/************************************** Main Program ********************************************/
int main()
{
//Define the Grids and Threads
dim3 threads(THX,THY);
dim3 grids(DIM/threads.x+1,1);
//define dimensions
int(*device_b)[DIM];
int(*device_a)[DIM];
int(*device_c)[DIM];
int A[DIM];
int B[DIM],C[DIM];
int i,iter=50;
// create the timer
unsigned int timer=0;
CUT_SAFE_CALL(cutCreateTimer(&timer));
// initialize the arrays A & B
for(i=0;i<DIM;i++)
{
A[i]=1;
B[i]=2;
}
// print the arrays A & B
printf("\n Array A\n\n");
for(i=0;i<DIM;i++)
printf("\t%d",A[i]);
printf("\n");
printf("\n Array B\n\n");
for(i=0;i<DIM;i++)
printf("\t%d",B[i]);
//ALLOCATE MEMORY IN GPU
int size=sizeof(int)*DIM;
cudaMalloc((void**)&device_a,size);
cudaMalloc((void**)&device_b,size);
cudaMalloc((void**)&device_c,size);
//FROM MEMORY FROM HOST TO DEVICE
cudaMemcpy(device_a,A,size,cudaMemcpyHostToDevice);
cudaMemcpy(device_b,B,size,cudaMemcpyHostToDevice);
// start the timer and specify the no of iterations
CUT_SAFE_CALL(cutStartTimer(timer));
for(int i=0;i<iter;i++)
{
// INVOKING KERNEL
add_in_gpu<<<grids,threads>>>(device_a,device_b,device_c);
cudaThreadSynchronize();
cudaError_t error = cudaGetLastError();
if (error != cudaSuccess)
printf("error :%s\n",cudaGetErrorString(error));
}
// stop the timer and fetch the the timer value
CUT_SAFE_CALL(cutStopTimer(timer));
// Result is copied to Host
cudaMemcpy(C,device_c,size,cudaMemcpyDeviceToHost);
// printing the resultant array
printf("\n");
printf("\n The sum of two arrays in GPU\n\n");
for(i=0;i<DIM;i++)
{
printf("%d\t%d\n",i,C[i]);
}
printf("\n\nGPU Processing time: %f (ms)\n",(cutGetTimerValue(timer)));
printf("\n");
//Free Device and Host Memory
cudaFree(device_a);
cudaFree(device_b);
cudaFree(device_c);
}
which platform r u executing this program on?? coz i execute it on windows xp, MS visual studio 2005 IDE, but i don’t know why im not able to execute, is it related to platform??
I execute this on Linux (Fedora Core 6) CUDA 1.0 on a GTS 320MB
/home/jcvaneijk/Desktop/testfolder/Ashraf% make
nvcc --compile -I/usr/local/cuda/include -I../../../work/MAKEHOME/.. -I/usr/local/cuda/include -I/usr/X11R6/include -I/usr/local/Trolltech/Qt-4.3.1/include -I/usr/local/Trolltech/Qt-4.3.1/include/QtXml -I/usr/local/Trolltech/Qt-4.3.1/include/QtNetwork -I/usr/local/Trolltech/Qt-4.3.1/include/QtCore main.cu -o release-LX/main.o
"main.cu", line 14: warning: parameter "C" was set but never used
int (*C)[1000000]){
^
Creating application identification
Linking executable release-LX/testAshraf
g++ release-LX/.Id.o release-LX/main.o -L../../../work/MAKEHOME/lib/release-LX -L/usr/local/cuda/lib -lcuda -lcudart -o release-LX/testAshraf
Stripping executable release-LX/testAshraf
strip release-LX/testAshraf
Making link ../../../work/MAKEHOME/install/release-LX/testAshraf to release-LX/testAshraf
This is my output. when I “make” This is compiled by our in house standard makefile so it is using some special things but the major thing is the nvcc call and after that the g++
Stripping my code down it looks something like this
/home/jcvaneijk/Desktop/testfolder/Ashraf% nvcc --compile -I/usr/local/cuda/include main.cu -o release-LX/main.o
"main.cu", line 14: warning: parameter "C" was set but never used
int (*C)[1000000]){
^
/home/jcvaneijk/Desktop/testfolder/Ashraf% g++ release-LX/main.o -L/usr/local/cuda/lib -lcuda -lcudart -o release-LX/testAshraf
Thank you so much for your help buddy, im actually using win xp, i think it’s something related to memory taken by GPU to display but im not sure what the exact problem is, maybe ill get some help in CUDA on win xp section.
Try not to cross post they don’t like that. Most of the people look in all the forums I think. At least I do because the most of the topics in those forums are not platform dependent.
What you can do is add checks (CUT_CHECK_ERROR macro) after your memory allocations, to see if they fail. Or just check if the returnvalue of CudaMalloc is non-zero.
But yes, having not enough free (contiguous) memory can also be a problem off course when going to big numbers, that never crossed my mind…
Ok, this is a very long confusing thread and nobody seems to have a clue what the problem is (myself included).
Here is a guess I haven’t seen anyone else mention:
You declare your array input to the kernel like this: int(*A)[DIM].
WHY??? You just need int *A and then calculate an index into it using a pitch. For coalescing performance you should allocate with cudaMallocPitch to get the pitch to be the right value.
I’m not positive, but I THINK the [DIM] is what is leading to your massive shared memory usage. Kernel arguments are stored in shared memory, so declaring a kernel call with some thousands of pointers as arguments will cause problems.
Oh wait, I guess I was mistaken in one of my assumptions. You aren’t even using 2D arrays, just 1D. Hence there is no need for cudaMallocPitch.
Thus I am even more confused why you are declaring the kernel arguments the way you are. Why are you trying to pass in an array of 90,000 pointers to integers when all you need is one pointer to index a 1D array of memory?
The kernel should be:
__global__ void add_in_gpu(int *A, int *B, int *C)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
C[idx] = A[idx]+ B[idx];
}
And you can allocate the device pointers like this:
int *device_a, *device_b, *device_c;
//ALLOCATE MEMORY IN GPU
int size=sizeof(int)*DIM;
cudaMalloc((void**)&device_a,size);
cudaMalloc((void**)&device_b,size);
cudaMalloc((void**)&device_c,size);
Edit: and there should be no problems mallocing 90,000 ints on the device. That’s only a few hundred kilobytes :) If you want to pin down where errors are occuring, you need to wrap every CUDA call with CUDA_SAFE_CALL and put CUT_CHECK_ERROR after every kernel call.
I was actually using a pointer to a 1D array, but however i have changed it to a pointer, but still im having the same problem. My program is not able to allocate memory for large values, im wondering if it is something related to the screen resolution??