Overhead of using more than one streams?

Hello all,

I tried to use two streams to execute kernel and memcpy between host and device concurrently.

In my application, two CPU thread generate tasks and they push generated task into one of two streams after the other.

Following is my code.

void CUDA_doAsync( GPU_Task_Unit* GTU ){

	unsigned int taskSize = CUDA_SEGMENT_SIZE * GTU->numSeg;

	unsigned int streamID = (GTU->id)%CUDA_NUM_STREAM;

	dim3 dimGrid(taskSize/BLOCK_SIZE,1);

	dim3 dimBlock(BLOCK_SIZE, 1);

	cutilSafeCall( cudaMemcpyAsync( dTasks[GTU->id], GTU->pStartSeg, taskSize * sizeof(CUDAtask), cudaMemcpyHostToDevice, stream[streamID] ) );

	cutilSafeCall( cudaMemcpyAsync( GTU->stateSend, dFlag_true, sizeof(char), cudaMemcpyDeviceToHost, stream[streamID] ) );

	doKernel<<<dimGrid, dimBlock, 0, stream[streamID]>>>( dTasks[GTU->id], dResults[GTU->id] );

	cutilSafeCall( cudaMemcpyAsync( GTU->result, dResults[GTU->id], taskSize * sizeof(CUDAresult), cudaMemcpyDeviceToHost, stream[streamID] ) );

	cutilSafeCall( cudaMemcpyAsync( GTU->stateTask, dFlag_true, sizeof(bool), cudaMemcpyDeviceToHost, stream[streamID] ) );


I expected that it may increase the performace of my App.

But, it made my applicaiton slow-down slightly.

I cannot understand the reason.

I use GTX285 and all host memory used in cudaMemcpyAsync are allocated by using cudaMallocHost().

Also, my applicaion does not need to synchronization at the end of the code.

Please, give me your advice.


You have two CPU threads? Does this mean they are running on different contexts? In which case, the streams would just add overhead, no? I didn’t think one could run two CPU threads on the same context… at least in pycuda (which uses the driver api) this is not allowed.

You can check if your memory is page-lock memory.

CUDA only support overlap between kernel launch and AsyncCopy when you use page lock memory.

And, if you use two host thread to use the same memory. you have to use CUDA 2.2Beta to make the memory be portable.

Hopefully its useful to you.

Two threads generate tasks but, just mater thread calls kernels.

So, it use just one context.


Thanks for your kind response.

I used page-locked memory allocated by calling cudaMallocHost() function.

In my case, the memory used by two CPU threads but, calling kernels is done by just one thread.

Although I changed my code to use cudaHostallocPortable as following your advice, it still shows similar performance with original code. <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

By the way, how can I check if memcpy and execute a kernel are overlapped or not?


Memory copy and Kernel launce only be launched concurrently when you have two work to do.

I mean:

you have two input pointers.

you want them to be operated with the same kernel.

you have two output pointers.

use two stream to do copyfromhosttodevice, and then also use the two stream to call the kernel.

you can get the sample in toolkit example.

it shows you when you have 2 same work on 2 diff memory. or 4 same work on 4 diff memory.

the performance will be better then only use one stream(or nostream).

the copy of one stream can be overlapped with the kernel launch of the other stream.

but if too much copy stream is used. the bandwidth is still be shared. So do the kernel resource. so the performance is not up linearly.