Concurrent Kernel Execution / Memory Transfer We can't get it to work...

kalnbach · February 6, 2009, 6:12pm

Hello!

We are working on a project for our lecture about using CUDA to reduce CPU load for certain fault detection algorithms. For that, we would like to use the feature of asynchronous, concurrent host/device memory transfer.

The documentation states that you need:

deviceOverlap to be 1
an asynchronous kernel call on a stream != 0
a call to cudaMemcpyAsynch() on a different stream != 0
nothing on stream 0 between the asynchronous calls

We are respecting all this, but we still don’t get the expected behavior. Mostly, the cudaMemcpyAsynch() call is started after the kernel is finished. We are using CUDA 2.1 on a GeForce 8600 GT under Ubuntu 8.4.

Does someone work with deviceOverlap? Is there another pitfall?

Example code would be especially nice :)

E.D_Riedijk · February 7, 2009, 6:47am

There is an example in the SDK I believe.

Fugl · February 9, 2009, 12:22pm

You also need the host side memory you copy from/to to be allocated with cudaMallocHost. Ordinary malloc’ed memory doesn’t cut it.

thesquiff · February 12, 2009, 1:34pm

I’m not having any success either. Can anyone spot my problem with this?

Currently, as a small test, I have an area of memory which I’m trying to download, process and return to host asynchronously. The kernel I’m running is simply a *2 mapping so no problems there, but I’m getting unspecified launch failures and errors looking like this:

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError at memory location 0x1c1efe20..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError at memory location 0x1c1efdd0..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError at memory location 0x1c1efdd0..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efdc8..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efdc8..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe20..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe04..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe04..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe00..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efde8..

Question: When I’m calling the kernel, should I be calling the total number of blocks, or the total number of blocks / the number of streams?

#define MEM_SIZE 8192

#define CHUNK_SIZE 1024

#define NUM_CHUNKS 8

cutilSafeCall(cudaMallocHost((void**)&textStream, sizeof(float)*MEM_SIZE));

  // allocate device memory

	float* d_idata;

	cutilSafeCall( cudaMalloc( (void**) &d_idata, mem_size));

	cudaMemset(d_idata,0,mem_size);

	float* h_idata = textStream;

	// allocate device memory for result

	float* d_odata;

	cutilSafeCall( cudaMalloc( (void**) &d_odata, mem_size));

	cudaMemset(d_odata,0,mem_size);

	// allocate mem for the result on host side

	float* h_odata;// = (float*) malloc( mem_size);

	cutilSafeCall(cudaMallocHost((void**)&h_odata, mem_size));

	memset(h_odata,0,mem_size);

some code to load data into textStream here. (NB: its not a stream, just a float array)

cudaStream_t stream[NUM_CHUNKS];

	for (int i = 0; i < NUM_CHUNKS; i++) {

		cutilSafeCall(cudaStreamCreate(&stream[i]));

	}

	int size = CHUNK_SIZE * sizeof(float);

	for (int i = 0; i < NUM_CHUNKS; i++) {

		cudaMemcpyAsync(d_idata + i*size, h_idata+i*size, size, cudaMemcpyHostToDevice, stream[i]);

	}

	for (int i = 0; i < NUM_CHUNKS; i++) {

		//grid,threads,shared mem, stream

		stringMatch<<< 16, 64, 256,stream[i]>>>( d_idata+i*size, d_odata+i*size);

	}

	for (int i = 0; i < NUM_CHUNKS; i++) {

		cudaMemcpyAsync(h_odata+i*size ,d_odata+i*size,size,cudaMemcpyDeviceToHost, stream[i]);

	}

	cudaThreadSynchronize();

	for (int i = 0; i < NUM_CHUNKS; i++) {

		cudaStreamDestroy(stream[i]);

	}

cpunerd · March 20, 2009, 1:56pm

You need not only deviceOverlap to be 1, but also compute capability to be 1.1. Sometimes the NVIDIA drivers will report deviceOverlap to be 1, even for 1.0 cards. This is a false report.

paulius · March 21, 2009, 1:20am

I don’t know about all the errors you’re getting. I think that’s independent of the streaming issue.

Does the simpleStreams SDK sample work and show overlap on your system?

I’m also not sure I understand your question about how many threadblocks to launch for each kernel. Remember that every kernel is launched in some stream. If you don’t provide an explicit stream argument to the launch, then the kernel is launched in the default (0) stream. So, the number of threadblocks has nothing to do with streams. Specify the number of threadblocks needed to process the data passed by a given kernel launch.

Paulius

Topic		Replies	Views
Asynchronous multi streaming: not working... CUDA Programming and Performance	2	522	May 13, 2018
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3176	July 14, 2010
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2312	January 18, 2023
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	910	July 7, 2017
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1787	June 23, 2010
Asynchronous Memcpy's not overlapping with asynchronous kernel execution despite using cuda streams? CUDA Programming and Performance cuda	4	1109	October 31, 2022
Asynchronous cudaMemcpy host to device not overlapping with kernel CUDA Programming and Performance	4	55	April 10, 2025
Don't observe overlapping behavior in streams CUDA Programming and Performance cuda , kernel	3	42	July 15, 2025
cudaMemcpyAsync Question Overlap HostToDevice and DeviceToHost trasfers CUDA Programming and Performance	2	5648	April 2, 2009
memory copy overlap CUDA Programming and Performance	7	14744	March 29, 2008

Concurrent Kernel Execution / Memory Transfer We can't get it to work...

Related topics