concurrent kernels

JA_rgen · April 29, 2011, 3:32pm

Hi, I am trying to speed up my apllication through concurrent kernel execution. I am using an GTX 580, by when I increase the kernel number it will be slower. Has anybody an idea? Hier is my awful code :-)

int nkernels = 8; 

    int tile = N/nkernels;

    int offset = blocks/nkernels;

    int nstreams = nkernels + 1;

    dim3 dimBlock(threads,1,1);

    dim3 dimGrid(offset,1,1);

cudaStream_t *streams = (cudaStream_t*) malloc(nstreams * sizeof(cudaStream_t));

      for(int i = 0; i < nstreams; i++)

          cudaStreamCreate(&(streams[i]));

cudaEvent_t *kernelEvent;

    kernelEvent = (cudaEvent_t*) malloc(nkernels * sizeof(cudaEvent_t));

    for(int i = 0; i < nkernels; i++)

       cudaEventCreateWithFlags(&(kernelEvent[i]), cudaEventDisableTiming);

for(int i = 0; i<nkernels; i++)

    {

      //reduction of one 1D array to an aarray of size(blocksize) final reduction on CPU

      reduceKernel <<< dimGrid, dimBlock, 0, streams[i] >>> (&d_idata[(i*tile)], &d_odata[offset*i], tile );

      cudaEventRecord(kernelEvent[i], streams[i]);

      cudaStreamWaitEvent( streams[ nstreams-1 ], kernelEvent[i],0);

    }

// release resources

    for(int i = 0; i < nkernels; i++) {

        cudaStreamDestroy(streams[i]); 

        cudaEventDestroy(kernelEvent[i]);

    }

    free(streams);

    free(kernelEvent);

  }

cudaMemcpy(h_tmp, d_odata, blocks*sizeof(computetype), cudaMemcpyDeviceToHost);

The result is fine but, computetime is slower than using one kernel launch.

YDD · April 29, 2011, 3:46pm

It doesn’t make sense to launch multiple copies of the same kernel if you can just do one big kernel launch… there’s an overhead on each launch, so why would you want to keep paying it? If you can overlap transfers to and from the GPU with kernels, that would help the overall wall time. Also, if you had a bunch of different small kernels, none of which could keep the GPU occupied, then it might make sense to try having more than one kernel active at once. But if you can do everything in one kernel, why wouldn’t you?

JA_rgen · May 2, 2011, 5:56am

Thanks for your answer,

I understand my error in reasoning. I try to overlap my kernels now.

Topic		Replies	Views
Concurrent Kernel execution overhead CUDA Programming and Performance	4	925	March 20, 2011
Kernel launch concurrency CUDA Programming and Performance	10	1803	December 11, 2014
Concurrent Kernel using GTX 570 on WinXp Concurrent Kernel CUDA Programming and Performance	0	673	October 10, 2011
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2658	April 29, 2019
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	906	June 18, 2010
multi task parallelization with cuda streams ? CUDA Programming and Performance	7	1458	September 14, 2017
My streams are not running concurrently CUDA Programming and Performance	7	1777	March 6, 2018
concurrent copy and execution CUDA Programming and Performance	0	1613	November 6, 2009
Concurrent executions of streams CUDA Programming and Performance	6	422	December 19, 2022
How lightweight are cudaStream_t's? CUDA Programming and Performance	6	1135	September 26, 2018

concurrent kernels

Related topics