My streams are not running concurrently

Hi to all!
I am trying to get experienced with cuda streams, and i am trying to run concurrenly 3 seven very simple streams, just to see how streams work.
The code is presented below:
//-----------------------------------------------------------------------------

cudaStream_t stream1;
cudaStream_t stream2;
cudaStream_t stream3;

cudaStreamCreate (&stream1);
cudaStreamCreate (&stream2);
cudaStreamCreate (&stream3);

cudaMemcpyAsync( d_img,img,384512 * sizeof(int),cudaMemcpyHostToDevice,stream1 );
s1<<<1,1,0,stream1>>>(d_img,d_wrkx);
cudaMemcpyAsync( wrkx,d_wrkx,384
512 * sizeof(int),cudaMemcpyDeviceToHost,stream1 );

cudaMemcpyAsync( d_img,img,384512 * sizeof(int),cudaMemcpyHostToDevice,stream2 );
s1<<<1,1,0,stream2>>>(d_img,d_wrkx);
cudaMemcpyAsync( wrkx,d_wrkx,384
512 * sizeof(int),cudaMemcpyDeviceToHost,stream2 );

cudaMemcpyAsync( d_img,img,384512 * sizeof(int),cudaMemcpyHostToDevice,stream3 );
s1<<<1,1,0,stream3>>>(d_img,d_wrkx);
cudaMemcpyAsync( wrkx,d_wrkx,384
512 * sizeof(int),cudaMemcpyDeviceToHost,stream3 );

cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);
cudaStreamSynchronize(stream3);

//---------------------------------------------------------------------------------------

I use the visual profiler nvprof to see the results of the executions , and i realise that the streams are not running in parallel. Unlikely, i have a seirial execution and i cannot understand why… For every kernel i use just 1 thread.
I use Jetson x1.

It would be very useful for me if you could help me to achieve fully concurrent streams.
Thaink you all!

have you allocated the host buffers using pinned memory?

Yes.
The allocation of the host arrays is:

cudaMallocHost((void **)&img, 512384sizeof(int));
cudaMallocHost((void **)&wrkx, 512384sizeof(int));
cudaMallocHost((void **)&wrky, 512384sizeof(int));

or

cudaHostAlloc((void **)&img, 512384sizeof(int),cudaHostAllocMapped);
cudaHostAlloc((void **)&wrkx, 512384sizeof(int),cudaHostAllocMapped);
cudaHostAlloc((void **)&wrky, 512384sizeof(int),cudaHostAllocMapped);

Either i use the 1st allocation or the 2nd, the result is the same. I do not have parallel streams.

i suggest you to provide entire code example, as well as pictures from visual profiler (in order to let us know which exactly parts of execution are non-parallel). the code snippet looks suspicious but may be you just omitted some parts

in particular:

  1. example in Programming Guide :: CUDA Toolkit Documentation uses NON-OVERLAPPED parts of buffers in different streams, while you use the same buffers in each stream
  2. you should alloc all buffers prior to execution
  3. data copying may be not overlapped due to limited amount of copy devices in GPU

also take into account Programming Guide :: CUDA Toolkit Documentation

PS: note that on this forum there is ‘code’ tag for code samples, it’s last button on the top of message edit box

txbob and BulatZiganshin thank you for your answers. Finally it works. I don’t know what was wrong, but finally with the same code it works!!

BulatZiganshin thank you for your advice.I am sorry but i am new to cuda programming and generally to programming forums.

Now i have another problem. I use the same code i posted, with one difference: I do not launch my kernels with 1 block and 1 thread, but with 512 blocks and 512 threads in them:

dim3 block (512);
dim3 grid(512) ;

When i use the nvvp i see that my kernels are not executing concurrently. I searched about it and i found out that this is normal because the gpu has not the resources to do something like that.And my question is:
If i have 3 independent kernels and each kernel needs 512x512 threads, is there any other way to execute these 3 kernels concurrently? I did a small exploration on my gpu and i found out that the maximum number of threads that allow me to have exactly concurrent kernels is 1024. I the case i need more? what should i do?

Thaink you in advance!

The machine does not have infinite capacity.

WHY you need to run them concurrently? if they will run concurrently in 3 seconds, will it be any better than running sequentially, each in 1 second?

512*512=256K threads, more than any GPU can run simultaneously. So, it’s reasonable that one kernel finishes its work before the next kernel starts. They may have small overlapping at the end of first kernel execution when amount of remaining active threads in this kernel is less than GPU can run concurrently - if your code fulfills all the requirements i mentioned above

Yes, i understand why it is reasonable! And thank you for this.
I was thinking if i could have the total execution time of the three kernels = 1 second, when 1 second is the execution time of one kernel, but i understood that this is impossible.
Additionally, i know that when i launch a kernel with 256K threads, that doesn’t mean that i will have 256K threads running exactly in parallel, but i will have a scheduling of available threads. I was wondering if something like this would happen to multiple kernels too. For example :
-1024 threads from 1st kernel
-512 threads from 3rd kernel
-1024 threads from 2nd kernel
-etc…

I see, that there is no difference, but i would like to know the hardware side of streams.

I have found some information about cuda streams from the software side, but nothing from the hardware side.
Could you please suggest me some recomendations about the hardware side of cuda streams?