call kernel multiple times , trying to use streams

Hello , I have a kernel and I am calling it in a loop , something like:

int NX = 80 , NY = 80;

dim3 BlocksDim ( BlocksPerGridX , BlocksPerGridY ); // 16 * 16
dim3 ThreadsPerBlock ( ThreadsPerBlockX , ThreadsPerBlockY ); // 16 * 16

for ( int i = 0; i < NX*NY; i++ )
{

    gpuErrchk( cudaMemcpy( devX , X[ i ] , Count[ i ] * sizeof( float ), cudaMemcpyHostToDevice ) );
    gpuErrchk( cudaMemcpy( devY , Y[ i ] , Count[ i ] * sizeof( float ), cudaMemcpyHostToDevice ) );


myKernel<<< BlocksDim,ThreadsPerBlock >>>(

	Count[ i ],  // size is : NX * NY
	devX,        // size is : the maximum value returned from 'Count' ( around 1-100)
	devY,        //  -||-
	devV,        //  -||-
	devResult ); // size is : BlocksDim * ThreadsPerBlock

	}

I want to ask if I can you use streams in this situation and how I can use them . I am trying something like:

const int NbOfStreams = 32;
const int N = BlocksDim * ThreadsPerBlock;

const int StreamSizeResult = N / NbOfStreams;
const int StreamBytesResult = StreamSizeResult * sizeof(float);

const int StreamSize = MaxCount / NbOfStreams;
const int StreamBytes = StreamSize * sizeof(float);

//allocate as host malloc the memory for X,Y,Count

// create events and streams
cudaEvent_t startEvent, stopEvent;
cudaStream_t stream[ NbOfStreams ];
checkCuda( cudaEventCreate(&startEvent) );
checkCuda( cudaEventCreate(&stopEvent) );


for ( int i = 0; i < NbOfStreams; i++ )
    cudaStreamCreate(&stream[ i ]);


for ( int i = 0; i < NbOfStreams; i++ ) 
{
    int offset = i * StreamSizeResult;

    cudaMemcpyAsync( &devX[offset], &X[offset],StreamBytes, cudaMemcpyHostToDevice,stream[ i ] );
    cudaMemcpyAsync( &devY[offset], &Y[offset],StreamBytes, cudaMemcpyHostToDevice,stream[ i ] );

    myKernel<<< StreamSizeResult / ThreadsPerBlock , ThreadsPerBlock , 0, stream[ i ] >>>(
       
        Count[ i ],  
        devX,        
	devY,        
	devV,       
	devResult );


    cudaMemcpyAsync( &host_Result[offset], &devResult[offset],StreamBytesResult, cudaMemcpyDeviceToHost, stream[ i ]);
  }

I have difficulties to setup properly the above.
I am not sure about the streamsize etc.

I used another size (StreamSizeResult) for the devResult and another for devX and devY.

I am not sure how I can put it all together.Any help appreciated.

Thanks!

you generally need 3 conditions, to benefit from streams:
a) multiple tasks (memory copies, kernels)
b) dependencies between tasks allowing some of the tasks to be moved ‘out of sequence’ and instead be run concurrently with other tasks
c) idle capacity

you may equally chop a kernel up into smaller sub-kernels, by chopping its kernel dimensions, but only if dependencies related to the kernel work permits this - i.e. the new sub-kernels must be able to run independent of each other
and you need good reason for this, like:
a) attempting to reduce memory footprint
b) attempting to reduce device dead/ idle gaps that normally form when the device and host need to synchronize with each other, and/ or when the device kernel dimensions change across a sequence of kernels, to adapt to work

this then would guide you in your setup, e.g:

“I have difficulties to setup properly the above.
I am not sure about the streamsize etc”

OK, thank you for the information.