CUDA Streams: Start at the same time

Hi,

I implemented streams in my CUDA script as shown.

   PT1<<<gride, blocke>>>(dvxdx, dvydy, dvxdy, dvydx, d_vx, d_vy, d_alpha, d_beta, d_index,nbe);
    cudaDeviceSynchronize();

    PT1_Etanbe<<<gride, blocke, 0, stream1>>>(Eta_nbe, d_etan, d_areas, nbe);


    PT1_x<<<gride, blocke, 0, stream2>>>(dvxdx, dvydy, dvxdy, dvydx, d_vx, d_vy, d_alpha, d_beta, d_index, kvx,  d_etan, d_Helem, d_areas, d_isice, nbe);

    PT1_y<<<gride, blocke, 0, stream3>>>(dvxdx, dvydy, dvxdy, dvydx, d_vx, d_vy, d_alpha, d_beta, d_index, kvy,  d_etan, d_Helem, d_areas, d_isice,  nbe);

I am looking to run the kernels in streams 1, 2 and 3 simultaneously. The qdrep file

shows that the kernels in those streams don’t begin at the same time and there is not much overlap in time. What am I missing?

Thanks for any information you can provide,
Anjali

This is a very common question. If each of your kernels fully occupy the GPU, there is no reason to expect overlap/concurrency.

Thank you for your reply. Does that mean if I want to achieve full concurrency I would need to run the streams on multiple GPUs (each stream on a different GPU)?

That should work.