Are there memory limitations on Device when using large arrays? Tesla C1060

Sarnath:

My diff output:

CUDA output

< zonez2[299008] = 86.000000

< zonez2[299009] = 65.000000

< zonez2[299010] = 84.000000

< zonez2[299011] = 66.000000

< zonez2[299012] = 71.000000

C output

zonez2[299008] = 73.000000

zonez2[299009] = 76.000000

zonez2[299010] = 72.000000

zonez2[299011] = 76.000000

zonez2[299012] = 67.000000

I did put the cudaThreadSynchronize() inside the kernel call. Also the 299008 value corresponds to 4096x73. Previously I was able to get correct results upto 43x4096 values and then the values got messed up.

My kernel call:

[codebox]

    for(num = 0; num < 86; ++num)

{

	zonez_create <<< grid3, block3 >>> (zonez2, im2_d, pix1, Nx, Ny, sizez, u3, u4, v3, v4, num);

	cudaThreadSynchronize();

    }

// Test zonez

double *zonez2_h;

cudaMallocHost((void**) &zonez2_h, sizeof(double) * Nx * Ny * sizez);

checkCUDAError("cudaMallocHost calls");

cudaMemcpy(zonez2_h, zonez2, sizeof(double) * Nx * Ny * sizez, cudaMemcpyDeviceToHost);

checkCUDAError("cudaMemcpy calls");

for(i = 0; i < Nx * Ny * 86; ++i)

{

	printf("zonez2[%d] = %lf\n", i, zonez2_h[i]);

}

cudaFreeHost(zonez2_h);

checkCUDAError("cudaFreeHost calls");

[/codebox]

ANy suggestions?

In one place u r using “sizez” and in other place u are using “86” (even in kernel argument u r using sizez…)…

Will that be a problem?

Thanks.
But is that going to make a difference? All the arrays have been initialized with sizez=3096. While testing/debugging, the values I got were not matching and hence I decided to scale down the kernel calls from 3096 to 86. Does that mean I need to change the array initialization?
I will try out your suggestion and compute the kernel for 3096 values and then check the values for the 86 *4096 values.

Also, Is it possible to call all the kernels in a loop?
I am thinking if I can compute a kernel 434096 times and then just call that kernel 72 times, I am effectively doing 4372*4096 kernel calls. Has anyone tried it ?
for example:

for (num =0; num ,72; num ++)
{
kernel1
cudathreadsynchronize()
kernel2 …kernel n
}

You own the code. So, you should take a stand on it. I was merely hinting at possible areas of trouble.

Should be possible…

You r missing cudaThreadSync…() for the 2nd launch. your code should be like:

for (num =0; num ,72; num ++)

{

kernel1

cudathreadsynchronize()

kernel2

cudaThreadSynchronize()

kernel3

cudaThreadSynchronize()

kernel n

cudaThreadSynchronize()

}

This wont affect performance much. If you are continuously queing kernel launches, the driver MAY reject it. Because till now, we dont have any info about the queue the driver maintains for kernel launches. So, THats why we are trying to play safe here…by not queing more than 1 at a time and see if that helps.

There is no need for a cudaThreadSyncronize.

So, say I queue a kernel that will take 30 seconds to complete. Say I queue this kernel in a FOR loop continuously…

Wont the driver reject the kernel launch after some time (due to some buffer getting full)?? OR will such kernel calls get blocked after some time?

There will be a synchronize ‘inserted’ when the buffer of kernel launches is full. You can see so when you time a kernel in a for loop without cudaThreadSynchronize before the moment when you time the amount of time that has passed.

question:

following the above discussion, I modified my kernel calls from 3096 to 43. However, when I try to find the maximum value in 4096 blocks, the device gives me a wrong value.

If I compute the 4096 blocks on the device only once, I get the correct value. Page-locked memory issue??

My code:

[codebox]

global void find_corrmaxz(Complex *zmax, cufftComplex *f4, int *shifted_x, int *shifted_y, int *reg_x, int *reg_y, int Nx, int Ny, int *fftindx, int *fftindy, int num)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;



if(f4[(num*Nx*Ny)+i*Nx+j].x>zmax[num].x)

{

	zmax[num].x = f4[(num*Nx*Ny)+i*Nx+j].x;

	reg_x[num] = j;

	reg_y[num] = i;

	shifted_x[num] = fftindy[j];

	shifted_y[num] = fftindx[i];

    }

}

Complex *zmax;

cudaMalloc((void**) &zmax, sizeof(Complex) * sizez);

int *s_x, *s_y, *r_x, *r_y;

cudaMalloc((void**) &s_x, sizeof(int) * sizez);

cudaMalloc((void**) &s_y, sizeof(int) * sizez);

cudaMalloc((void**) &r_x, sizeof(int) * sizez);

cudaMalloc((void**) &r_y, sizeof(int) * sizez);

    int *fftindx_d, *fftindy_d;

cudaMalloc((void**) &fftindx_d, sizeof(int) * Nx);

cudaMalloc((void**) &fftindy_d, sizeof(int) * Ny);

cudaMalloc((void**) &f4z_d, sizeof(cufftComplex) * Nx * Ny * sizez); // Batch Inverse FFT output

for(num = 0; num < 43; ++num)

{

	find_corrmaxz <<< grid3, block3 >>> (zmax, f4z_d, s_x, s_y, r_x, r_y, Nx, Ny, fftindx_d, fftindy_d, num);

	cudaThreadSynchronize();

}[/codebox]

Results when I compute for a single 4096 block:

zmax = 16049790976.000000, shifted_x = 33, shifted_y = 32… reg_x = 1, reg_y = 0

which means f4[1] is the maximum value.

Results when I compute for 43x4096 blocks:

zmax_h[0] = 14039384064.000000, shifted_x = 55, shifted_y = 31… reg_x = 23, reg_y = 63

zmax_h[1] = 14039384064.000000, shifted_x = 55, shifted_y = 31… reg_x = 23, reg_y = 63

zmax_h[2] = 13200375808.000000, shifted_x = 55, shifted_y = 63… reg_x = 23, reg_y = 31

zmax_h[3] = 14039384064.000000, shifted_x = 55, shifted_y = 31… reg_x = 23, reg_y = 63

zmax_h[4] = 13225515008.000000, shifted_x = 52, shifted_y = 31… reg_x = 20, reg_y = 31

zmax_h[5] = 13200375808.000000, shifted_x = 55, shifted_y = 63… reg_x = 23, reg_y = 31

zmax_h[6] = 14039384064.000000, shifted_x = 55, shifted_y = 31… reg_x = 23, reg_y = 63

zmax_h[7] = 14039384064.000000, shifted_x = 55, shifted_y = 31… reg_x = 23, reg_y = 63

zmax_h[8] = 13391756288.000000, shifted_x = 55, shifted_y = 31… reg_x = 23, reg_y = 63

which means that the maximum value is either f4[4055] or f4[2007] or f4[4087].

So this indicates that the kernel fails to find max value from 4096 values when it has about 43x4096 values in memory. I am doing a cudaFree and cudaFreeHost wherever possible.

Can someone advise?

When do we use __syncthreads() inside kernel??
Is using CudaThreadSynchronize() inside kernel the same?
What is the page-locked memory limit on CUDA devices ?

To synchronize the threads inside a block during kernel execution.

You cant use cudaThreadSynchronize() inside a kernel. YOu can only use it in your host program.

And, that will cause the host to wait for the GPU kernel to complete. Using this feature, you can choose to either wait for the GPU kernel to complete OR do some other work and later come back and synchronize… (CPU + GPU overlap)

Page-locked memory resides in your system RAM and NOT on CUDA device. This memory would depend on total RAM installed on your computer, The amount of memory pressure on your system (number of applications that are running) and the operating system you use. It is very difficult to define hard limits.

Thanks Sarnath.
I asked those questions because my kernel fails in device mode to find the maximum value in a array. In emulation mode, the results are correct. cannot debug the kernel on device as I am on 32-bit Linux OS

I am 99.99% sure that it is a bug in your program and not with TESLA.

Few tips to debug without a debugger:

  1. Try reducing the complexity and create a simple program and see if it works.

  2. Return from the middle of your kernel after writing some status value in the global memory. You can examine global memory later and verify if the values are as expected.

Perhaps post the entire program, or better yet a simplified test case that is complete enough for someone else to reproduce your unexpected results.

Ideally, I have 43 blocks, with each block containing 4096 values. Now in C, I just found the highest value in a block and recorded its position (x and y points)

In CUDA, I see that inspite of using a cudathreadsynchronize() inside the kernel call, my positions for the highest value in a block is not correct.
Expected behavior: The kernel launches with one block of 4096 values, finds the highest value and records the position. This is repeated for the remaining kernel calls.

Expected behavior matches the emulation mode.

Implementation behavior: The kernel finds the highest value in the first 5 kernel calls after which it gives an error(meaning it does not find the highest value) I tried using __syncthreads() inside the kernel but that has not helped.

ANy suggestions or advice??

Yes. As someone already suggested, post the simplest, complete example which demonstrates the bug, because posting random snippets of code or psuedocode doesn’t really help if the problem is something subtle, like a race or cache coherency problem.

Do a thorought Code Review and fix your bug.

As mentioned before:

here’s the complete program that does a sort on 4096 values.

[codebox]

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <time.h>

#include <assert.h>

// includes, project

#include <cutil.h>

#include <cufft.h>

//functions

double frand (void)

{

double value;

value = ((double) rand()/(RAND_MAX));

return value;

}

//kernels

// Simple utility function to check for CUDA runtime errors

void checkCUDAError(const char *msg)

{

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err) 

{

    fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );

    exit(-1);

}                         

}

global void fft_indx(int *fftind, int n, int inc)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;



if(i<inc)

{

	fftind[i] = i + inc;

}

else

{

	fftind[i] = i - inc;

}

}

global void find_corrmaxz(double *zmax, double *f4, int *shifted_x, int *shifted_y, int *reg_x, int *reg_y, int Nx, int Ny, int *fftindx, int *fftindy, int *edgel_x, int *edger_x, int *edgel_y, int *edger_y, int num)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;



zmax[num] = f4[(num*Nx*Ny)+num];

if(i < Nx && j < Ny )

{

	if(f4[(num*Nx*Ny)+i*Nx+j] >= zmax[num])

	{

		zmax[num] = f4[(num*Nx*Ny)+i*Nx+j];

		reg_x[num] = j;

		reg_y[num] = i;

		shifted_x[num] = fftindy[j];

		shifted_y[num] = fftindx[i];

	}

	

	if(reg_x[num] == 0)

	{

		edgel_x[num] = Nx - 1;

		edger_x[num] = reg_x[num] + 1;

	}

	else if (reg_x[num] == Nx-1)

	{

		edgel_x[num] = reg_x[num] - 1;

		edger_x[num] = 0;

	}

	else

	{

		edgel_x[num] = reg_x[num] - 1;

		edger_x[num] = reg_x[num] + 1;

	}

	if(reg_y[num] == 0)

	{

		edgel_y[num] = Ny - 1;

		edger_y[num] = reg_y[num] + 1;

	}

	else if (reg_y[num] == Ny-1)

	{

		edgel_y[num] = reg_y[num] - 1;

		edger_y[num] = 0;

	}

	else

	{

		edgel_y[num] = reg_y[num] - 1;

		edger_y[num] = reg_y[num] + 1;

	}				

}

// printf("edgel_x[%d] = %d, edger_x[%d] = %d, edgel_y[%d] = %d, edger_y[%d] = %d\n", num, edgel_x[num], num, edger_x[num], num, edgel_y[num], num, edger_y[num]);

}

/************************* Main ************************/

int main()

{

// check the compute capability of the device

int cuda_device = 0;

int num_devices=0;

float elapsed_time;

unsigned int timer;

cutCreateTimer(&timer);

CUDA_SAFE_CALL( cudaGetDeviceCount(&num_devices) );

if(0==num_devices)

{

    printf("your system does not have a CUDA capable device\n");

    return 1;

}



// check if the command-line chosen device ID is within range, exit if not

if( cuda_device >= num_devices )

{

    printf("choose device ID between 0 and %d\n", num_devices-1);

    return 1;

}

cudaSetDevice( cuda_device );

cudaDeviceProp device_properties;

CUDA_SAFE_CALL( cudaGetDeviceProperties(&device_properties, cuda_device) );

if( (1 == device_properties.major) && (device_properties.minor < 1))

{

	printf("%s does not have compute capability 1.1 or later\n\n", device_properties.name);

}



printf("running on: %s\n\n", device_properties.name );



// create CUDA event handles

cudaEvent_t start_event, stop_event;

cudaEventCreate(&start_event);

cudaEventCreate(&stop_event);



cudaEventRecord(start_event, 0);



int Nx = 64;

int Ny = 64;

int sizex = 72;

int sizey = 43;

int sizez = 3096;	

unsigned int seed = 123456789;

int i;

int incx = Nx >> 1;

int incy = Ny >> 1;



double *f1, *f1_d;

cudaMallocHost((void**) &f1, sizeof(double) * Nx * Ny * sizez);

checkCUDAError("cudaMallocHost calls");



cudaMalloc((void**) &f1_d, sizeof(double) * Nx * Ny * sizez);

checkCUDAError("cudaMalloc calls");

// srand(seed);



for(i = 0; i < Nx * Ny * sizez; ++i)

{

	f1[i] = frand();

	// printf("f[%d] = %lf\n",i, f1[i] );

}



cudaMemcpy(f1_d, f1, sizeof(double) * Nx * Ny * sizez, cudaMemcpyHostToDevice);

checkCUDAError("cudaMemcpy calls");





/* Initialize fftindx and fftindy */

int *fftindx_d, *fftindy_d;

cudaMalloc((void**) &fftindx_d, sizeof(int) * Nx);

cudaMalloc((void**) &fftindy_d, sizeof(int) * Ny);



// Check for any CUDA errors

checkCUDAError("cudaMalloc calls");



dim3 block3(32,16);

dim3 grid3(2,4);

fft_indx <<< grid3, block3 >>> (fftindx_d, Nx, incx);

fft_indx <<< grid3, block3 >>> (fftindy_d, Ny, incy);



double *zmax;

double *zmax_h;

cudaMalloc((void**) &zmax, sizeof(double) * sizez);

cudaMallocHost((void**) &zmax_h, sizeof(double) * sizez);

cudaMemset(zmax, 0, sizeof(double) * sizez);



int *s_x, *s_y, *r_x, *r_y;

cudaMalloc((void**) &s_x, sizeof(int) * sizez);

cudaMalloc((void**) &s_y, sizeof(int) * sizez);

cudaMalloc((void**) &r_x, sizeof(int) * sizez);

cudaMalloc((void**) &r_y, sizeof(int) * sizez);

cudaMemset(s_x, 0, sizeof(int) * sizez);

cudaMemset(s_y, 0, sizeof(int) * sizez);

cudaMemset(r_x, 0, sizeof(int) * sizez);

cudaMemset(r_y, 0, sizeof(int) * sizez);



int *edgel_x, *edger_x, *edgel_y, *edger_y;

cudaMalloc((void**) &edgel_x, sizeof(int) * sizez);

cudaMalloc((void**) &edgel_y, sizeof(int) * sizez);

cudaMalloc((void**) &edger_x, sizeof(int) * sizez);

cudaMalloc((void**) &edger_y, sizeof(int) * sizez);

cudaMemset(edgel_x, 0, sizeof(int) * sizez);

cudaMemset(edgel_y, 0, sizeof(int) * sizez);

cudaMemset(edger_x, 0, sizeof(int) * sizez);

cudaMemset(edger_y, 0, sizeof(int) * sizez);



int *s_xh, *s_yh, *r_xh, *r_yh;

cudaMallocHost((void**) &s_xh, sizeof(int) * sizez);

cudaMallocHost((void**) &s_yh, sizeof(int) * sizez);

cudaMallocHost((void**) &r_xh, sizeof(int) * sizez);

cudaMallocHost((void**) &r_yh, sizeof(int) * sizez);



int *edgel_xh, *edger_xh, *edgel_yh, *edger_yh;

cudaMallocHost((void**) &edgel_xh, sizeof(int) * sizez);

cudaMallocHost((void**) &edgel_yh, sizeof(int) * sizez);

cudaMallocHost((void**) &edger_xh, sizeof(int) * sizez);

cudaMallocHost((void**) &edger_yh, sizeof(int) * sizez);



for(int num = 0; num < sizey; ++num)

{

	find_corrmaxz <<< grid3, block3 >>> (zmax, f1_d, s_x, s_y, r_x, r_y, Nx, Ny, fftindx_d, fftindy_d, edgel_x, edger_x, edgel_y, edger_y, num);

	cudaThreadSynchronize();

}

cudaMemcpy(zmax_h, zmax, sizeof(double) * sizez, cudaMemcpyDeviceToHost);

cudaMemcpy(s_xh, s_x, sizeof(int) * sizez, cudaMemcpyDeviceToHost);

cudaMemcpy(s_yh, s_y, sizeof(int) * sizez, cudaMemcpyDeviceToHost);

cudaMemcpy(r_xh, r_x, sizeof(int) * sizez, cudaMemcpyDeviceToHost);

cudaMemcpy(r_yh, r_y, sizeof(int) * sizez, cudaMemcpyDeviceToHost);



cudaMemcpy(edgel_xh, edgel_x, sizeof(int) * sizez, cudaMemcpyDeviceToHost);

cudaMemcpy(edgel_yh, edgel_y, sizeof(int) * sizez, cudaMemcpyDeviceToHost);

cudaMemcpy(edger_xh, edger_x, sizeof(int) * sizez, cudaMemcpyDeviceToHost);

cudaMemcpy(edger_yh, edger_y, sizeof(int) * sizez, cudaMemcpyDeviceToHost);

for(i = 0; i < sizey; ++i)

{

	printf("zmax_h[%d] = %lf, shifted_x = %d, shifted_y = %d... reg_x = %d, reg_y = %d\n", i, zmax_h[i], s_xh[i], s_yh[i], r_xh[i], r_yh[i]);

	printf("edgel_x[%d] = %d, edger_x[%d] = %d, edgel_y[%d] = %d, edger_y[%d] = %d\n", i, edgel_xh[i], i, edger_xh[i], i, edgel_yh[i], i, edger_yh[i]);		

}

cudaFreeHost(s_xh);

cudaFreeHost(s_yh);

cudaFreeHost(r_xh);

cudaFreeHost(r_yh);

cudaFreeHost(zmax_h);

cudaFreeHost(edgel_xh);

cudaFreeHost(edgel_yh);

cudaFreeHost(edger_xh);

cudaFreeHost(edger_yh);

checkCUDAError("cudaFreeHost calls");



// Check elapsed time

cudaEventRecord(stop_event, 0);

cudaEventSynchronize(stop_event);

cudaEventElapsedTime(&elapsed_time, start_event, stop_event);

cudaEventDestroy(start_event);

cudaEventDestroy(stop_event);



cudaFreeHost(f1);

cudaFree(f1_d);

cudaFree(fftindx_d);

cudaFree(fftindy_d);

cudaFree(s_x);

cudaFree(s_y);

cudaFree(r_x);

cudaFree(r_y);

cudaFree(zmax);

cudaFree(edgel_x);

cudaFree(edgel_y);

cudaFree(edger_x);

cudaFree(edger_y);

checkCUDAError("cudaFree calls");



printf("\n\n Elapsed time on GPU = %.5f ms\n", elapsed_time);

return 0;

}

[/codebox]

So the kernel computes the maximum value among 4096 values and then records its position.

Doubts:

  1. Is this the correct way to distribute data sort within a kernel?

  2. I need to run the kernel 3096 times. But I am sticking to 43 times, because my data printed out from the kernel is incorrect after 43.(page locked memory limit??)

  3. I haven’t used cudathreadsynchronize as that is important only for timing the kernel.

You have a race condition.

If I read this correctly, you have 4096 2D images, each of which is 64x64. For each image, you want to find the value and location of the greatest pixel. For each image, you launch 64x64 threads, and then for each thread, you test if the pixel for this particular thread is greater than the max found so far, and if so, you update the max value and record other information such as the location within the 64x64 image.

This is a classic race condition. Suppose the max is 0.1, and for one thread the pixel is 0.3, and for the other thread the pixel is 0.5. Both will determine that the value is greater than the max. One thread will attempt to set the value to 0.3, and the other will attempt to set the value to 0.5. It’s anyone’s guess as to which will win. And likewise the pixel location and other information will conflict and not always come from the greatest pixel.

The precise scheduling differs between device and emulation, and can even differ from one execution to the next. In emulation, there are far fewer cores, which might force serialization and give correct results even though the algorithm is incorrect as a parallel algorithm.

Rewrite your algorithm to launch one thread for each of the 4096 images (or however many you have). Then within each thread, search the image and record the output to zmax and associated other values. Then the threads will not conflict with each other.

Wow !!!
Thanks for that explanation. I knew I had to rewrite the kernel, but I needed to know the reason.
Doubts:

  1. When you mention launching just one thread to sort 64x64 values, what did you mean ?? I am using
    dim3 block3(32,16);
    dim3 grid3(2,4);

After your suggestion should I be using
dim3 block3(1,1);
dim3 grid3(64,64);

Won’t this violate the conditions regarding the 512 thread limit?? Effectively, I am launching 4096 grids with one thread block each, so a bit confused. (tried the 4096 grid usage before writing and it did not change the results).

  1. Would shared memory usage benefit in this kernel? I have never used shared memory, but after going through the SDK examples, I am thinking of using it ??

Thanks again for all the suggestions !!!

What i meant was a total of 4096 threads. Meaning threads per block * number of blocks = 4096. Threads per block must not be greater than 512. Threads per block should be a multiple of 32, preferably 64 or more. Number of blocks should be at least the number of MPs, preferably 2 or 3 times that number or more.

The simplest way to do it would be

int sizez = 4096;

int blocksize = 64;  // should be a multiple of 32

int nblocks = (sizez+blocksize-1)/blocksize; // round up if blocksize does not evenly divide sizez

find_max<<<nblocks, blocksize>>>(etc.)

Then within the kernel, something like this:

int z = blockIdx.x*blockDim.x + threadIdx.x;

if (z < sizez) {

  max = pixel at (0, 0, z) - 100; // force update

  for (int y=0; y < Ny; y++) {

	for (int x=0; x < Nx; x++) {

	  if (pixel at (x, y, z) is greater than max) {

		update max;

		xlocation = x;

		ylocation = y;

	  }

	}

  }

  save max, xlocation, and ylocation to global memory.

}

By the way, this also means eliminating the loop over z. A single kernel can find all the max locations without looping.