Maximum limit on the amount of pinned memory using cudaMallocHost()

itzsid · July 9, 2010, 1:04pm

Is there any limit on the maximum amount of host memory that can be pinned for asynchronous memory transfer. I tried pinning around 300 MB of memory on a CPU with 3 GB RAM and the process gets killed. Do I have to use some kind of work around to get it working.

ONeill · July 9, 2010, 2:10pm

Afaik the amount of available pinned memory is system dependant. It is not limited by the GPU. On my old rig I saw the limit at somewhere near 500 megs. You could try to allocate in small sets of few megs each to ensure the allocation size is not your problem and you are restricted by your systems available pinned.

itzsid · July 10, 2010, 3:57pm

Thanks a lot :) … I had some bug in the code and it’s solved … But I am getting another problem in using asynchronous transfer…

I ran the asyncAPI code on GTX 480 and it gave the output as:

CUDA device [GeForce GTX 480]
time spent executing by the GPU: 33.52
time spent by CPU in CUDA calls: 33.52
CPU executed 15 iterations while waiting for GPU to finish

[asyncAPI] → Test Results:
PASSED

Now the time spent by CPU in CUDA calls is the same as the time spent executing by the GPU which clearly shows that the call is not asynchronous… What can be the problem ?

The same code is working fine on Tesla T10 Processor.

tera · July 10, 2010, 7:21pm

How do you deduce the call is not asynchronous? It might well be that the CPU time is not spent in the kernel call, but in a subsequent copy from device to host waiting for the data to become available.

itzsid · July 10, 2010, 7:37pm

The code is like this:

cutilCheckError( cutStartTimer(timer) );

		cudaEventRecord(start, 0);

		cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0);

		increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value);

		cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0);

		cudaEventRecord(stop, 0);

	cutilCheckError( cutStopTimer(timer) );

	// have CPU do some work while waiting for stage 1 to finish

	unsigned long int counter=0;

	while( cudaEventQuery(stop) == cudaErrorNotReady )

	{

		counter++;

	}

	cutilSafeCall( cudaEventElapsedTime(&gpu_time, start, stop) );

	// print the cpu and gpu times

	printf("time spent executing by the GPU: %.2f\n", gpu_time);

	printf("time spent by CPU in CUDA calls: %.2f\n", cutGetTimerValue(timer) );

	printf("CPU executed %d iterations while waiting for GPU to finish\n", counter);

Now since all the calls around the kernel is asynchronous, the control should be returned back to host in zero time and the host should be increasing the counter till the cuda call gets completed. But as seen the output, the control is never returned back to host and cuda call blocks the host itself and acts like a synchronous call. The copy from device to host call used is also asynchronous as cudaMemcpyAsync is used rather than cudaMemcpy.

tera · July 10, 2010, 7:40pm

Ah, ok.

Maximum limit on the amount of pinned memory using cudaMallocHost()

CUDA device [GeForce GTX 480] time spent executing by the GPU: 33.52 time spent by CPU in CUDA calls: 33.52 CPU executed 15 iterations while waiting for GPU to finish

CUDA device [GeForce GTX 480]
time spent executing by the GPU: 33.52
time spent by CPU in CUDA calls: 33.52
CPU executed 15 iterations while waiting for GPU to finish