Are there memory limitations on Device when using large arrays? Tesla C1060

I have a kernel that need to be executed 3096 times. When I run this code in emulation, my results match with the simple C program. When I run this loop and kernel on the device, my values get messed up. Has some one seen this issue?

If everything looks correct in simulation, how can we determine what fails to work on the device. Is this a memory issue?? How can I resolve this ??

Is there a different way to malloc huge arrays?? the zonez array in this case has 12 million values (4096 x 3096).


    int Nx = 64;

    int Ny = 64;

    int sizez = 3096;

double *zonez1, *zonez2;

cudaMalloc((void**) &zonez1, sizeof(double) * Nx * Ny * sizez);

cudaMalloc((void**) &zonez2, sizeof(double) * Nx * Ny * sizez);

    dim3 block3(32,16);

dim3 grid3(2,4);



for(num = 0; num < sizez; ++num)


	zonez_create <<< grid3, block3 >>> (zonez1, im1_d, pix1, Nx, Ny, sizez, u1, u2, v1, v2, num);

	zonez_create <<< grid3, block3 >>> (zonez2, im2_d, pix1, Nx, Ny, sizez, u3, u4, v3, v4, num);



cutStopTimer( timer );  // Stop timer

float zt1 = cutGetTimerValue(timer);

printf("zonez_create computation time: %0.3f ms\n\n", zt1);

// Test zonez

double *zonez1_h, *zonez2_h;

cudaMallocHost((void**) &zonez1_h, sizeof(double) * Nx * Ny * sizez);

cudaMallocHost((void**) &zonez2_h, sizeof(double) * Nx * Ny * sizez);

cudaMemcpy(zonez1_h, zonez1, sizeof(double) * Nx * Ny * sizez, cudaMemcpyDeviceToHost);

cudaMemcpy(zonez2_h, zonez2, sizeof(double) * Nx * Ny * sizez, cudaMemcpyDeviceToHost);

for(i = 0; i < Nx * Ny * sizez; ++i)


	printf("zonez2[%d] = %lf\n", i, zonez2_h[i]);






global void zonez_create(double *zone, double *im, int pix1, int Nx, int Ny, int sizez, int *u1, int *u2, int *v1, int *v2, int num)


int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int z1 = u1[num]+i;

int z2 = v1[num]+j;

if (i < Nx && j < Ny && z1 < u2[num] && z2 < v2[num])


	zone[(num*Nx*Ny)+i*Nx+j] = im[z2*pix1+z1];




101MB is not a huge array. You probably forgot to compile with -arch sm_13.

My run script looks like this:

nvcc -g -pg -D_DEBUG -o …/obj/cu_pivchk1 …/src/ \

–host-compilation C -arch sm_13 \

-I/usr/local/cuda/include -L/usr/local/cuda/lib -lcufft \

-I/home/vivek/NVIDIA_CUDA_SDK/common/inc/ -L/home/vivek/NVIDIA_CUDA_SDK/lib -lcutil

After debugging for a long time, the results from the kernel execution on the device do not match for some random indexes with the C program. The kernel execution in emulation mode gives a complete match with the results obtained from the C program. The only reason I can think of : in emulation mode, no physical memory limitation?? and in device the array is too big to fit in memory??

I also tried splitting the kernel calls, so that only kernel is called 3096 times in a single loop. Still no change in results.

Your code looks right. The only thing I might suggest is try it with floats. If the results then match emulation (or of they don’t), it may help eliminate some possible problems.


Device emulation is NOT a complete one and will NOT expose race conditions properly. you need to check your code for race conditions. That is what It looks like.

The cudaMallocHost() allocation for 96MB twice – can be a big problem if u dont have lot of RAM. Consider doing it one by one (after freeing one)

CHeck for errors

I did a cudaFree call before checking the values generated by kernel.

However, how should I free the cudaMallocHost call? I get an error:

Cuda error: cudaFree calls: invalid device pointer.

when I do this:



checkCUDAError("cudaFree calls");[/codebox]

The checkCUDAError function is:

[codebox] void checkCUDAError(const char *msg)


cudaError_t err = cudaGetLastError();

if( cudaSuccess != err) 


    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );




when I do a free(zonez2_h);

I get a segmentation fault.

ANy suggestions??

Also Sarnath, I am using a C1060 on a host machine that has 4GB RAM. Should I still run into memory issues?

Jamie: why floats? any particular reason?

You need to use the proper free:


If you use the wrong one, it will give you an error.

You have plenty of memory on the C1060,

I was thinking, if the problem persists with floats, then it doesn’t have to do with 1.3 capability or -arch sm_13, and that possibility can be eliminated.

Now judging from the posts of Sarnath and others, it looks like it has to do with your memory allocation and freeing. Page-locked memory can be a limited resource, much more limited than regular pageable host memory.

mfatica: any suggestions on whether I should prefer floats to doubles in such large arrays??

Jamie: I understand the fact that I should free memory on host after checking the values, etc. But the values on the device needs to be called with the other functions. I cannot free that in the middle of the program. Any suggestions??

They are not large arrays. Doubles will work just fine.

no progress yet freeing memory. ANy suggestions?

My values are still not matching,
for eg: device values:
< zonez2[12509184] = 265568151417664542740425707193393877088933967732712906816850
< zonez2[12509185] = 0.000000
< zonez2[12509186] = -0.000000
< zonez2[12509187] = 0.000000
< zonez2[12509188] = -0.000000
< zonez2[12509189] = 0.000000

C program/emulation values:

zonez2[12509184] = 27.000000
zonez2[12509185] = 27.000000
zonez2[12509186] = 30.000000
zonez2[12509187] = 19.000000
zonez2[12509188] = 33.000000
zonez2[12509189] = 25.000000

From the code in the first post, you are using pointers to host memory in your kernel for u1, u2, v1 and v2.
Unless there are some cudaMallocs that are not shown, it will give you the right results in emulation ( because everything is running in host memory) but incorrect results when running on the GPU.

mfatica: I am using cudaMalloc to initialize all those pointers u1, u2, v1 and v2


    cudaMalloc((void**) &u1, sizeof(int) * sizez);

cudaMalloc((void**) &u2, sizeof(int) * sizez);

cudaMalloc((void**) &v1, sizeof(int) * sizez);

cudaMalloc((void**) &v2, sizeof(int) * sizez);


So is there a limit on single dimensional array on the device? From my tests, my values are correct, when I call the kernel 43 times. This means that the values are correct upto 4096*43 = 176128 indexes. When the kernel is called more than 43 times, the values in the array are wrong. I have to call the kernel 3096 times to calculate all the values in the array.

Any suggestions?

mfatica: I think there’s a hard limit on 1D arrays on the device. because as long as I stick to this value (43*4096) my values are correct. Though in the next stage, my FFT fails for such a huge 1D array. any pointers??

There muss be sthg wrong with your code. 4GB Memory is huge. I dont think there r any hardlimits as you say

Sarnath: The kernel executes correctly for a fixed size (i.e. 43*4096) on device mode and I am thinking it may be a hard limit. Even I came to this limit by testing values obtained from the kernel.

Also, I am passing a (2353x1430) size image into the kernel. So this kernel is actually computing windows on this huge image and giving me back the windows. I have been debugging but I cannot find a solution to this array limit.

Try to put cudaThreadSyncrhonize() inside the FOR loop and see if it has any effect.