Are there memory limitations on Device when using large arrays? Tesla C1060

vivekv80 · April 1, 2009, 5:53pm

I have a kernel that need to be executed 3096 times. When I run this code in emulation, my results match with the simple C program. When I run this loop and kernel on the device, my values get messed up. Has some one seen this issue?

If everything looks correct in simulation, how can we determine what fails to work on the device. Is this a memory issue?? How can I resolve this ??

Is there a different way to malloc huge arrays?? the zonez array in this case has 12 million values (4096 x 3096).

[codebox]

    int Nx = 64;

    int Ny = 64;

    int sizez = 3096;

double *zonez1, *zonez2;

cudaMalloc((void**) &zonez1, sizeof(double) * Nx * Ny * sizez);

cudaMalloc((void**) &zonez2, sizeof(double) * Nx * Ny * sizez);

    dim3 block3(32,16);

dim3 grid3(2,4);



cutResetTimer(timer);

    cutStartTimer(timer);

for(num = 0; num < sizez; ++num)

{

	zonez_create <<< grid3, block3 >>> (zonez1, im1_d, pix1, Nx, Ny, sizez, u1, u2, v1, v2, num);

	zonez_create <<< grid3, block3 >>> (zonez2, im2_d, pix1, Nx, Ny, sizez, u3, u4, v3, v4, num);

}

cudaThreadSynchronize();



cutStopTimer( timer );  // Stop timer

float zt1 = cutGetTimerValue(timer);

printf("zonez_create computation time: %0.3f ms\n\n", zt1);



// Test zonez

double *zonez1_h, *zonez2_h;

cudaMallocHost((void**) &zonez1_h, sizeof(double) * Nx * Ny * sizez);

cudaMallocHost((void**) &zonez2_h, sizeof(double) * Nx * Ny * sizez);

cudaMemcpy(zonez1_h, zonez1, sizeof(double) * Nx * Ny * sizez, cudaMemcpyDeviceToHost);

cudaMemcpy(zonez2_h, zonez2, sizeof(double) * Nx * Ny * sizez, cudaMemcpyDeviceToHost);



for(i = 0; i < Nx * Ny * sizez; ++i)

{

	printf("zonez2[%d] = %lf\n", i, zonez2_h[i]);

}

[/codebox]

=========================================

Kernel:

[codebox]

global void zonez_create(double *zone, double *im, int pix1, int Nx, int Ny, int sizez, int *u1, int *u2, int *v1, int *v2, int num)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

int z1 = u1[num]+i;

int z2 = v1[num]+j;

if (i < Nx && j < Ny && z1 < u2[num] && z2 < v2[num])

{

	zone[(num*Nx*Ny)+i*Nx+j] = im[z2*pix1+z1];

}

}

[/codebox]

tmurray · April 1, 2009, 6:00pm

101MB is not a huge array. You probably forgot to compile with -arch sm_13.

vivekv80 · April 1, 2009, 6:21pm

My run script looks like this:

nvcc -g -pg -D_DEBUG -o …/obj/cu_pivchk1 …/src/cu_pivchk1.cu \

–host-compilation C -arch sm_13 \

-I/usr/local/cuda/include -L/usr/local/cuda/lib -lcufft \

-I/home/vivek/NVIDIA_CUDA_SDK/common/inc/ -L/home/vivek/NVIDIA_CUDA_SDK/lib -lcutil

vivekv80 · April 1, 2009, 8:32pm

After debugging for a long time, the results from the kernel execution on the device do not match for some random indexes with the C program. The kernel execution in emulation mode gives a complete match with the results obtained from the C program. The only reason I can think of : in emulation mode, no physical memory limitation?? and in device the array is too big to fit in memory??

I also tried splitting the kernel calls, so that only kernel is called 3096 times in a single loop. Still no change in results.

Jamie_K · April 2, 2009, 3:27am

Your code looks right. The only thing I might suggest is try it with floats. If the results then match emulation (or of they don’t), it may help eliminate some possible problems.

Sarnath · April 2, 2009, 4:53am

@Vivek,

Device emulation is NOT a complete one and will NOT expose race conditions properly. you need to check your code for race conditions. That is what It looks like.

Sarnath · April 2, 2009, 5:00am

The cudaMallocHost() allocation for 96MB twice – can be a big problem if u dont have lot of RAM. Consider doing it one by one (after freeing one)

CHeck for errors

vivekv80 · April 2, 2009, 4:25pm

I did a cudaFree call before checking the values generated by kernel.

However, how should I free the cudaMallocHost call? I get an error:

Cuda error: cudaFree calls: invalid device pointer.

when I do this:

[codebox]

     cudaFree(zonez2_h);

checkCUDAError("cudaFree calls");[/codebox]

The checkCUDAError function is:

[codebox] void checkCUDAError(const char *msg)

{

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err) 

{

    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );

    exit(-1);

}

}[/codebox]

when I do a free(zonez2_h);

I get a segmentation fault.

ANy suggestions??

Also Sarnath, I am using a C1060 on a host machine that has 4GB RAM. Should I still run into memory issues?

Jamie: why floats? any particular reason?

mfatica · April 2, 2009, 4:33pm

You need to use the proper free:

cudaMallocHost/cudaFreeHost
cudaMalloc/cudaFree
malloc/free

If you use the wrong one, it will give you an error.

You have plenty of memory on the C1060,

Jamie_K · April 2, 2009, 5:14pm

I was thinking, if the problem persists with floats, then it doesn’t have to do with 1.3 capability or -arch sm_13, and that possibility can be eliminated.

Now judging from the posts of Sarnath and others, it looks like it has to do with your memory allocation and freeing. Page-locked memory can be a limited resource, much more limited than regular pageable host memory.

vivekv80 · April 2, 2009, 5:19pm

mfatica: any suggestions on whether I should prefer floats to doubles in such large arrays??

vivekv80 · April 2, 2009, 5:20pm

Jamie: I understand the fact that I should free memory on host after checking the values, etc. But the values on the device needs to be called with the other functions. I cannot free that in the middle of the program. Any suggestions??

mfatica · April 2, 2009, 5:24pm

They are not large arrays. Doubles will work just fine.

vivekv80 · April 2, 2009, 11:53pm

no progress yet freeing memory. ANy suggestions?

My values are still not matching,
for eg: device values:
< zonez2[12509184] = 265568151417664542740425707193393877088933967732712906816850
1275174723125390058306059571648957287645395757720904662978488
7255057533816305053713052474408564412946349599471930538872086
5364486388361036069090439470445556737492729629646210151749950
928582321198337863788221638986120438951496908800.000000
< zonez2[12509185] = 0.000000
< zonez2[12509186] = -0.000000
< zonez2[12509187] = 0.000000
< zonez2[12509188] = -0.000000
< zonez2[12509189] = 0.000000

C program/emulation values:

zonez2[12509184] = 27.000000
zonez2[12509185] = 27.000000
zonez2[12509186] = 30.000000
zonez2[12509187] = 19.000000
zonez2[12509188] = 33.000000
zonez2[12509189] = 25.000000

mfatica · April 3, 2009, 12:09am

From the code in the first post, you are using pointers to host memory in your kernel for u1, u2, v1 and v2.
Unless there are some cudaMallocs that are not shown, it will give you the right results in emulation ( because everything is running in host memory) but incorrect results when running on the GPU.

vivekv80 · April 3, 2009, 4:54pm

mfatica: I am using cudaMalloc to initialize all those pointers u1, u2, v1 and v2

[codebox]

    cudaMalloc((void**) &u1, sizeof(int) * sizez);

cudaMalloc((void**) &u2, sizeof(int) * sizez);



cudaMalloc((void**) &v1, sizeof(int) * sizez);

cudaMalloc((void**) &v2, sizeof(int) * sizez);

[/codebox]

So is there a limit on single dimensional array on the device? From my tests, my values are correct, when I call the kernel 43 times. This means that the values are correct upto 4096*43 = 176128 indexes. When the kernel is called more than 43 times, the values in the array are wrong. I have to call the kernel 3096 times to calculate all the values in the array.

Any suggestions?

vivekv80 · April 7, 2009, 12:20pm

mfatica: I think there’s a hard limit on 1D arrays on the device. because as long as I stick to this value (43*4096) my values are correct. Though in the next stage, my FFT fails for such a huge 1D array. any pointers??

Sarnath · April 8, 2009, 5:57am

There muss be sthg wrong with your code. 4GB Memory is huge. I dont think there r any hardlimits as you say

vivekv80 · April 8, 2009, 2:15pm

Sarnath: The kernel executes correctly for a fixed size (i.e. 43*4096) on device mode and I am thinking it may be a hard limit. Even I came to this limit by testing values obtained from the kernel.

Also, I am passing a (2353x1430) size image into the kernel. So this kernel is actually computing windows on this huge image and giving me back the windows. I have been debugging but I cannot find a solution to this array limit.

Sarnath · April 8, 2009, 2:23pm

Try to put cudaThreadSyncrhonize() inside the FOR loop and see if it has any effect.

Topic		Replies	Views
Kernel configuration and maximum array size problem. CUDA Programming and Performance	8	6567	January 25, 2009
limit of computation CUDA Programming and Performance	44	33137	April 8, 2008
Maximum Threads for Kernel Call CUDA Programming and Performance	38	16738	May 25, 2010
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18518	October 17, 2013
Problem with read access violation for large arrays in unified memory CUDA Programming and Performance	16	3315	July 16, 2018
Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction CUDA Programming and Performance	14	13486	January 24, 2009
find maximum value in an array along with index CUDA Programming and Performance	40	19122	October 11, 2010
GTX460 Max of 1 Block? CUDA Programming and Performance	8	7247	July 3, 2011
[solved] Array not fully transfered to device memory CUDA Programming and Performance	17	2335	July 14, 2019
Working with really large arrays in CUDA (how to prevent negative indexes?) CUDA Programming and Performance	5	2394	November 19, 2019

Are there memory limitations on Device when using large arrays? Tesla C1060

Related topics