Concurrent Kernel Execution / Memory Transfer We can't get it to work...

Hello!

We are working on a project for our lecture about using CUDA to reduce CPU load for certain fault detection algorithms. For that, we would like to use the feature of asynchronous, concurrent host/device memory transfer.

The documentation states that you need:

  • deviceOverlap to be 1
  • an asynchronous kernel call on a stream != 0
  • a call to cudaMemcpyAsynch() on a different stream != 0
  • nothing on stream 0 between the asynchronous calls

We are respecting all this, but we still don’t get the expected behavior. Mostly, the cudaMemcpyAsynch() call is started after the kernel is finished. We are using CUDA 2.1 on a GeForce 8600 GT under Ubuntu 8.4.

Does someone work with deviceOverlap? Is there another pitfall?

Example code would be especially nice :)

There is an example in the SDK I believe.

You also need the host side memory you copy from/to to be allocated with cudaMallocHost. Ordinary malloc’ed memory doesn’t cut it.

I’m not having any success either. Can anyone spot my problem with this?

Currently, as a small test, I have an area of memory which I’m trying to download, process and return to host asynchronously. The kernel I’m running is simply a *2 mapping so no problems there, but I’m getting unspecified launch failures and errors looking like this:

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError at memory location 0x1c1efe20..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError at memory location 0x1c1efdd0..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError at memory location 0x1c1efdd0..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efdc8..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efdc8..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe20..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe04..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe04..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efe00..

First-chance exception at 0x7c812aeb in scan.exe: Microsoft C++ exception: cudaError_enum at memory location 0x1c1efde8..

Question: When I’m calling the kernel, should I be calling the total number of blocks, or the total number of blocks / the number of streams?

#define MEM_SIZE 8192

#define CHUNK_SIZE 1024

#define NUM_CHUNKS 8

cutilSafeCall(cudaMallocHost((void**)&textStream, sizeof(float)*MEM_SIZE));

  // allocate device memory

	float* d_idata;

	cutilSafeCall( cudaMalloc( (void**) &d_idata, mem_size));

	cudaMemset(d_idata,0,mem_size);

	float* h_idata = textStream;

	// allocate device memory for result

	float* d_odata;

	cutilSafeCall( cudaMalloc( (void**) &d_odata, mem_size));

	cudaMemset(d_odata,0,mem_size);

	// allocate mem for the result on host side

	float* h_odata;// = (float*) malloc( mem_size);

	cutilSafeCall(cudaMallocHost((void**)&h_odata, mem_size));

	memset(h_odata,0,mem_size);

some code to load data into textStream here. (NB: its not a stream, just a float array)

cudaStream_t stream[NUM_CHUNKS];

	for (int i = 0; i < NUM_CHUNKS; i++) {

		cutilSafeCall(cudaStreamCreate(&stream[i]));

	}

	int size = CHUNK_SIZE * sizeof(float);

	for (int i = 0; i < NUM_CHUNKS; i++) {

		cudaMemcpyAsync(d_idata + i*size, h_idata+i*size, size, cudaMemcpyHostToDevice, stream[i]);

	}

	for (int i = 0; i < NUM_CHUNKS; i++) {

		//grid,threads,shared mem, stream

		stringMatch<<< 16, 64, 256,stream[i]>>>( d_idata+i*size, d_odata+i*size);

	}

	for (int i = 0; i < NUM_CHUNKS; i++) {

		cudaMemcpyAsync(h_odata+i*size ,d_odata+i*size,size,cudaMemcpyDeviceToHost, stream[i]);

	}

	cudaThreadSynchronize();

	for (int i = 0; i < NUM_CHUNKS; i++) {

		cudaStreamDestroy(stream[i]);

	}

You need not only deviceOverlap to be 1, but also compute capability to be 1.1. Sometimes the NVIDIA drivers will report deviceOverlap to be 1, even for 1.0 cards. This is a false report.

I don’t know about all the errors you’re getting. I think that’s independent of the streaming issue.

Does the simpleStreams SDK sample work and show overlap on your system?

I’m also not sure I understand your question about how many threadblocks to launch for each kernel. Remember that every kernel is launched in some stream. If you don’t provide an explicit stream argument to the launch, then the kernel is launched in the default (0) stream. So, the number of threadblocks has nothing to do with streams. Specify the number of threadblocks needed to process the data passed by a given kernel launch.

Paulius