Overhead of using more than one streams?

BlueKDS · April 12, 2009, 8:55am

Hello all,

I tried to use two streams to execute kernel and memcpy between host and device concurrently.

In my application, two CPU thread generate tasks and they push generated task into one of two streams after the other.

Following is my code.

void CUDA_doAsync( GPU_Task_Unit* GTU ){

	unsigned int taskSize = CUDA_SEGMENT_SIZE * GTU->numSeg;

	unsigned int streamID = (GTU->id)%CUDA_NUM_STREAM;

	dim3 dimGrid(taskSize/BLOCK_SIZE,1);

	dim3 dimBlock(BLOCK_SIZE, 1);

	cutilSafeCall( cudaMemcpyAsync( dTasks[GTU->id], GTU->pStartSeg, taskSize * sizeof(CUDAtask), cudaMemcpyHostToDevice, stream[streamID] ) );

	cutilSafeCall( cudaMemcpyAsync( GTU->stateSend, dFlag_true, sizeof(char), cudaMemcpyDeviceToHost, stream[streamID] ) );

	doKernel<<<dimGrid, dimBlock, 0, stream[streamID]>>>( dTasks[GTU->id], dResults[GTU->id] );

	cutilSafeCall( cudaMemcpyAsync( GTU->result, dResults[GTU->id], taskSize * sizeof(CUDAresult), cudaMemcpyDeviceToHost, stream[streamID] ) );

	cutilSafeCall( cudaMemcpyAsync( GTU->stateTask, dFlag_true, sizeof(bool), cudaMemcpyDeviceToHost, stream[streamID] ) );

}

I expected that it may increase the performace of my App.

But, it made my applicaiton slow-down slightly.

I cannot understand the reason.

I use GTX285 and all host memory used in cudaMemcpyAsync are allocated by using cudaMallocHost().

Also, my applicaion does not need to synchronization at the end of the code.

Please, give me your advice.

Thanks.

gatoatigrado · April 12, 2009, 8:11pm

You have two CPU threads? Does this mean they are running on different contexts? In which case, the streams would just add overhead, no? I didn’t think one could run two CPU threads on the same context… at least in pycuda (which uses the driver api) this is not allowed.

darot · April 13, 2009, 2:40am

You can check if your memory is page-lock memory.

CUDA only support overlap between kernel launch and AsyncCopy when you use page lock memory.

And, if you use two host thread to use the same memory. you have to use CUDA 2.2Beta to make the memory be portable.

Hopefully its useful to you.

Hello all,

I tried to use two streams to execute kernel and memcpy between host and device concurrently.

In my application, two CPU thread generate tasks and they push generated task into one of two streams after the other.

Following is my code.
void CUDA_doAsync( GPU_Task_Unit* GTU ){

	unsigned int taskSize = CUDA_SEGMENT_SIZE * GTU->numSeg;

	unsigned int streamID = (GTU->id)%CUDA_NUM_STREAM;

	dim3 dimGrid(taskSize/BLOCK_SIZE,1);

	dim3 dimBlock(BLOCK_SIZE, 1);

	cutilSafeCall( cudaMemcpyAsync( dTasks[GTU->id], GTU->pStartSeg, taskSize * sizeof(CUDAtask), cudaMemcpyHostToDevice, stream[streamID] ) );

	cutilSafeCall( cudaMemcpyAsync( GTU->stateSend, dFlag_true, sizeof(char), cudaMemcpyDeviceToHost, stream[streamID] ) );

	doKernel<<<dimGrid, dimBlock, 0, stream[streamID]>>>( dTasks[GTU->id], dResults[GTU->id] );

	cutilSafeCall( cudaMemcpyAsync( GTU->result, dResults[GTU->id], taskSize * sizeof(CUDAresult), cudaMemcpyDeviceToHost, stream[streamID] ) );

	cutilSafeCall( cudaMemcpyAsync( GTU->stateTask, dFlag_true, sizeof(bool), cudaMemcpyDeviceToHost, stream[streamID] ) );

}
I expected that it may increase the performace of my App.

But, it made my applicaiton slow-down slightly.

I cannot understand the reason.

I use GTX285 and all host memory used in cudaMemcpyAsync are allocated by using cudaMallocHost().

Also, my applicaion does not need to synchronization at the end of the code.

Please, give me your advice.

Thanks.

BlueKDS · April 13, 2009, 3:34am

Two threads generate tasks but, just mater thread calls kernels.

So, it use just one context.

thanks.

BlueKDS · April 13, 2009, 7:29am

Thanks for your kind response.

I used page-locked memory allocated by calling cudaMallocHost() function.

In my case, the memory used by two CPU threads but, calling kernels is done by just one thread.

Although I changed my code to use cudaHostallocPortable as following your advice, it still shows similar performance with original code. <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

By the way, how can I check if memcpy and execute a kernel are overlapped or not?

Thanks.

darot · April 14, 2009, 12:05am

Memory copy and Kernel launce only be launched concurrently when you have two work to do.

I mean:

you have two input pointers.

you want them to be operated with the same kernel.

you have two output pointers.

use two stream to do copyfromhosttodevice, and then also use the two stream to call the kernel.

you can get the sample in toolkit example.

it shows you when you have 2 same work on 2 diff memory. or 4 same work on 4 diff memory.

the performance will be better then only use one stream(or nostream).

the copy of one stream can be overlapped with the kernel launch of the other stream.

but if too much copy stream is used. the bandwidth is still be shared. So do the kernel resource. so the performance is not up linearly.

Topic		Replies	Views
Cannot force kernels to concurrent execution CUDA Programming and Performance	8	5547	April 28, 2012
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
How to parallel cudaMemPrefetchAsync to different GPUs CUDA Programming and Performance	7	761	May 24, 2023
My streams are not running concurrently CUDA Programming and Performance	7	1771	March 6, 2018
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2948	December 18, 2008
streams in Multi-gpu system CUDA Programming and Performance	7	6015	May 23, 2017
Cannot get any stream parallelism. CUDA Programming and Performance	13	1278	December 31, 2019
How to check work is done by different GPU in multi GPU environment CUDA Programming and Performance	8	3001	June 18, 2009
Concurrent memcpy and kernel execution CUDA Programming and Performance	5	1413	December 9, 2014
Overlapping memcpy and compute is serialized on some machines CUDA Programming and Performance cuda , nsight , performance	2	350	May 19, 2024

Overhead of using more than one streams?

Related topics