Overlapping kernel execution and data transfer

I am trying to test overlapping kernel execution and data transfer using streams.

After compiling some example codes from ‘Professional CUDA C Programming’ by Cheng, I visualized the result on nvvp.
(‘simpleMultiAddDepth.cu’, ‘simpleMultiAddBreath.cu’ at chapter06)

However, there is no overlapping at all. All streams are serialized.

I don’t know why it doesn’t work although it seems there’s nothing wrong in the code, and there is no error message either.

Should I enable something or set environment variables for concurrent kernel execution before??

Thank you.

You should ensure that your GPU can even perform overlapping kernel execution and data transfer.

NVIDIA has a CUDA sample called deviceQuery, run it and observe the line stating “Concurrent copy and kernel execution:”. If it says some sort of positive remark then we can move from there.

For example: I’m on a K5100M and mine states “Yes with 2 copy engine(s)”

Look for something like that.

Here’s my device query, and I think this GPU can perform overlapping kernel execution and data transfer.

Device 0: “GeForce GTX 750”
CUDA Driver Version / Runtime Version 8.0 / 7.5
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 971 MBytes (1017774080 bytes)
( 4) Multiprocessors, (128) CUDA Cores/MP: 512 CUDA Cores
GPU Max Clock rate: 1238 MHz (1.24 GHz)
Memory Clock rate: 2505 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GTX 750
Result = PASS

Yeah so it definitely can.

I don’t have that example in front of me but I guess maybe I can ask some quick questions to see if you are trying to do concurrent processing and transfer.

Are you using cudaMemcpyAsync? This is important as overlapping data transfers won’t work without it.

Second question, are you using multiple streams so that kernel execution and data transfer is overlapping?

Lastly, how are you determining that it isn’t overlapping? Are you using the Visual Profiler? If not you should.

Yes, I am using all you asked.

I am determining overlapping on nvvp.

I saw the statement, “With so many threads, the available hardware resources may become the primary limiting factor for concurrency as they prevent launching eligible kernels.”, and reduced the size of data and block dimension.

By doing so, I got concurrent kernel execution and overlapping between kernels and memory transfers.

However, I still don’t have overlapping between memory transfer although I am using cudaMemcpyAsync.

Is it still because of the GPU resources??

Thank you.

Maybe I’m missing what you are saying, you won’t get overlapping memory transfer with inputs. You get overlap when you are outputting the last data sets output and inputting the next data sets input.


/*          Simple Text Based Diagram                                              */
/*                                                                                 */
/*          [ Input ][Compute][Output ][ Input ][Compute][Output ]...              */
/*                   [ Input ][Compute][Output ]...                                */
/*                            [ Input ][Compute][Output ]...                       */

all gpus i seen have at most one copy-in device and one copy-out device. so, you can overlap computations, copying from device to ram and copying back, but you can’t overlap two copy operations in the same direction. and this doesn’t make sense anyway since speed in each direction is limited by PCI-E throughput, i.e. 12 GB/s or so

in addition,

  • to overlap copy-in and copy-out, GPU should have dual-copy-engine (GeForce GTX9xx or later).
  • host memory haould be allocated by cudaHostAlloc/cudaMallocHost rather than usual malloc/new.
  • copy-in/kernel/copyout sequences should be executed individual streams

Here’s some of my code.

__global__ void vecAddGpu(int *a, int *b, int *sum, const int N, const int R)
  int tid = blockDim.x * blockIdx.x + threadIdx.x;

  if(tid < N)
    for(int i=0; i<R; i++)
      sum[tid] = a[tid] + b[tid]

int nsize = 512 * (1<<5);
int dimx = 128;
int repeat = 256;
int nbyte = sizeof(int)*nsize;

int isize = nsize/NSTREAM;
int ibyte = sizeof(int)*isize;
dim3 Db(dimx,1)
dim3 Dg((ibyte + Db.x - 1)/Db.x,1)

CHECK( cudaHostAlloc((void**)&h_a, nbyte, cudaHostAllocDefault) );
CHECK( cudaHostAlloc((void**)&h_b, nbyte, cudaHostAllocDefault) );
CHECK( cudaHostAlloc((void**)&gpuRef, nbyte, cudaHostAllocDefault) );

CHECK( cudaMalloc((void**)&d_a, nbyte) );
CHECK( cudaMalloc((void**)&d_b, nbyte) );
CHECK( cudaMalloc((void**)&d_sum, nbyte) );

cudaStream_t streams[NSTREAM];
for(int i=0; i<NSTREAM; i++)

for(int i=0; i<NSTREAM; i++){
  int ioffset = i * isize;
  CHECK( cudaMemcpyAsync(&d_a[ioffset], &h_a[ioffset], ibyte, cudaMemcpyHostToDevice, streams[i]) );
  CHECK( cudaMemcpyAsync(&d_b[ioffset], &h_b[ioffset], ibyte, cudaMemcpyHostToDevice, streams[i]) );

  vecAddGpu<<<Dg, Db, 0, streams[i]>>>(&d_a[ioffset], &d_b[ioffset], &d_sum[ioffset], isize, repeat);

  CHECK( cudaMemcpyAsync(&gpuRef[ioffset], &d_sum[ioffset], ibyte, cudaMemcpyDeviceToHost, streams[i]) );

I think there is no problem on the code, but it was just my fault to see overlapping in data transfer in the same directions, which is fool.

But, initially I had no overlapping at all when I had relatively large amount of data to process.

I was suspicious about it as a possible cause, and reduced it a lot. Then, I started seeing overlapping.

How to determine or check the appropriate mount of data size to see overlapping in a given GPU?

I would be great if it is easy to check such things since GPU resources are limiting factor of overlapping, as I know.

Thank you.

@Yongsk, if you use the Visual Profiler you can see a lot of those things that you are looking for. It seems like you are using it, you just need to look into more of the data that is there.

Also it might not be due to the large amount of data you have to process (maybe @txbob or @njuffa if they can correct me on this.) I don’t think data size has any implications. Maybe if the input is substantially larger than the output might be a reason why this is occurring (in this case the output is only overlapping for a very brief period.)