Maximum limit on the amount of pinned memory using cudaMallocHost()

Is there any limit on the maximum amount of host memory that can be pinned for asynchronous memory transfer. I tried pinning around 300 MB of memory on a CPU with 3 GB RAM and the process gets killed. Do I have to use some kind of work around to get it working.

Afaik the amount of available pinned memory is system dependant. It is not limited by the GPU. On my old rig I saw the limit at somewhere near 500 megs. You could try to allocate in small sets of few megs each to ensure the allocation size is not your problem and you are restricted by your systems available pinned.

Thanks a lot :) … I had some bug in the code and it’s solved … But I am getting another problem in using asynchronous transfer…

I ran the asyncAPI code on GTX 480 and it gave the output as:

CUDA device [GeForce GTX 480]
time spent executing by the GPU: 33.52
time spent by CPU in CUDA calls: 33.52
CPU executed 15 iterations while waiting for GPU to finish

[asyncAPI] → Test Results:
PASSED

Now the time spent by CPU in CUDA calls is the same as the time spent executing by the GPU which clearly shows that the call is not asynchronous… What can be the problem ?

The same code is working fine on Tesla T10 Processor.

How do you deduce the call is not asynchronous? It might well be that the CPU time is not spent in the kernel call, but in a subsequent copy from device to host waiting for the data to become available.

The code is like this:

cutilCheckError( cutStartTimer(timer) );

		cudaEventRecord(start, 0);

		cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0);

		increment_kernel<<<blocks, threads, 0, 0>>>(d_a, value);

		cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0);

		cudaEventRecord(stop, 0);

	cutilCheckError( cutStopTimer(timer) );

	// have CPU do some work while waiting for stage 1 to finish

	unsigned long int counter=0;

	while( cudaEventQuery(stop) == cudaErrorNotReady )

	{

		counter++;

	}

	cutilSafeCall( cudaEventElapsedTime(&gpu_time, start, stop) );

	// print the cpu and gpu times

	printf("time spent executing by the GPU: %.2f\n", gpu_time);

	printf("time spent by CPU in CUDA calls: %.2f\n", cutGetTimerValue(timer) );

	printf("CPU executed %d iterations while waiting for GPU to finish\n", counter);

Now since all the calls around the kernel is asynchronous, the control should be returned back to host in zero time and the host should be increasing the counter till the cuda call gets completed. But as seen the output, the control is never returned back to host and cuda call blocks the host itself and acts like a synchronous call. The copy from device to host call used is also asynchronous as cudaMemcpyAsync is used rather than cudaMemcpy.

Ah, ok.