using streams for async memory operations Is it worth splitting kernel launch into several streams ?

asm · May 19, 2009, 6:37pm

Hi!

I’ve never tried using streams to overlap memory operations with kernel execution…

I wonder, if I have N blocks (where N is considerably large - 30-40 thousands) and each block

uses M bytes of global mem (where M is let’s say 1-2 Kb)

then, would I get any performance benefit if I split my kernel launch into K streams ?

i.e., smth like:

for i =0 to K

  memcpyAsync(stream[i], HostToDev)

grid = dim3(N / K)

for i =0 to K // launch K kernels asynchronously

  kernel<<< threads, grid, 0, stream[i] >>>();

for i =0 to K

  memcpyAsync(stream[i], DevToHost)

thanks

jjtapiav · May 19, 2009, 7:36pm

After some playing around with some of the new features in CUDA 2.2 (like cudaHostAlloc and cudaHostAllocMapped), I’ve found that using this new feature has overall better performance than manually programming the asynchronous copying or doing everything in one sweep.

Then again this is just one case among many.

asm · May 20, 2009, 10:41am

ah thanks, I really didn’t pay attention that the new cudaHostAlloc != cudaMallocHost

I want to evaluate both approaches on my kernel and then will post the results

Topic		Replies	Views
Overhead of using more than one streams? CUDA Programming and Performance	5	6284	April 14, 2009
Simple test of streams versus several launches Some performance questions CUDA Programming and Performance	2	3325	March 6, 2009
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2370	May 30, 2009
Multiple streams. CUDA Programming and Performance	1	3474	June 22, 2011
STREAMS CUDA Programming and Performance	0	772	November 8, 2009
Kernel Queueing CUDA Programming and Performance	8	9818	June 29, 2009
concurrent copy and execution CUDA Programming and Performance	0	1646	November 6, 2009
about streaming style sample code in Programming Guide ... why such a style? CUDA Programming and Performance	5	1526	January 23, 2009
a question about the asynchronous mechanism and stream CUDA Programming and Performance	3	1952	December 10, 2008
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1299	December 15, 2022

using streams for async memory operations Is it worth splitting kernel launch into several streams ?

Related topics