Concurrent Kernel execution overhead

hexxert · March 18, 2011, 11:40am

Well, i’m new to cuda and testing with concurrent kernel execution! But for know my results are insatisfatory…

As a simple example i have created a kernel that does the following!!!

__global__ void Teste(int *A)

{

	int idx = blockIdx.x*blockDim.x + threadIdx.x;

	__shared__ int kk[nelement];

	kk[threadIdx.x] = idx*idx;

	for(int i=0;i<nelement;i++)

		kk[threadIdx.x] = idx*idx;

	

	__syncthreads();        // it doesn't matter it is only to keep it a little busy

	A[idx] = kk[threadIdx.x];

}

after this i’ve created n streams where n is variable and depends on the test purpose.

So the host code for this is…

#include "cuda.h"

#include "stdio.h"

#define nKern 1

#define nelement 32

int *a_d;

	int *a_h;

	cudaStream_t *streams = (cudaStream_t*) malloc((nKern+1) * sizeof(cudaStream_t));	

	cudaMallocHost((void **)&a_h,nKern*sizeof(int)*nelement); 	

	cudaMalloc((void **)&a_d, sizeof(int) * nelement * nKern);

	

	

    	for(int i = 0; i < nKern+1; i++)

        	cudaStreamCreate(&(streams[i]));

	

	cudaEvent_t start_event, stop_event;

        cudaEventCreate(&start_event) ;

	cudaEventCreate(&stop_event) ;

	cudaEventRecord(start_event, 0);

	cudaEvent_t *kernelEvent;

	kernelEvent = (cudaEvent_t*) malloc(nKern * sizeof(cudaEvent_t));

	for(int i = 0; i < nKern; i++)

        cudaEventCreateWithFlags(&(kernelEvent[i]), cudaEventDisableTiming);

	for(int i=0; i < nKern; i++ )

	{

		Teste <<< 1,nelement, 0, streams[i] >>> (&a_d[i*nelement]);

		cudaEventRecord(kernelEvent[i], streams[i]);

		cudaStreamWaitEvent(streams[nKern], kernelEvent[i],0);

	}

	

	

	float elapsed_time;	

	cudaMemcpyAsync(a_h,a_d,nKern*sizeof(int)*nelement,cudaMemcpyDeviceToHost, streams[nKern]);

		

	cudaEventRecord(stop_event, 0) ;

	cudaEventSynchronize(stop_event) ;

    	cudaEventElapsedTime(&elapsed_time, start_event, stop_event) ;

		

	

	for(int i=0; i< nelement*nKern;i++)

		printf(" %d \n",a_h[i] );

	

	for(int i = 0; i < nKern; i++) {

		cudaStreamDestroy(streams[i]);

		cudaEventDestroy(kernelEvent[i]);

    	}	

	

	printf("%f ms \n", elapsed_time);

    	cudaEventDestroy(start_event);

    	cudaEventDestroy(stop_event);

    	cudaFreeHost(a_h);

    	cudaFree(a_d);

	free(streams);

    	cudaThreadExit();

One thing that i don’t understand is why when setting different nKern’s (1…15) (i have a GTX480 in Opensuse 11.3 (64 bit)) i get different execution times…

So i will aprecciate if anyone could explain why i don’t get the same execution time for nKern = 1 and nKern = 2…15?

best regards

Bruno Faria

jarjar · March 19, 2011, 12:07am

As you are doubling the number of kernel you are also doubling the memory accesses that need to be shared with other kernels. Hence you will see that the computing time will gradually increase with the number of kernels.

hexxert · March 19, 2011, 7:13pm

Before anything else thanks for your fast repply!

I do understand what you have said but in a case where only shared memory is used shouldn’t the kernels times be equal? I’m thinking this because as each SM has it’s on shared memory resource!

Thanks in advance
Bruno Faria

tera · March 20, 2011, 11:10am

I believe your kernel is far too short to run several in parallel. You are basically just measuring the setup overhead for kernel invocation. Do some real work in each kernel so that it runs for several milliseconds at least (better for a second so that you can see results already without measurement) and does not finish before the second kernel starts.

hexxert · March 20, 2011, 2:34pm

Yes that kernel is too short, but it is just a small example that i’ve mounted to better understand concurrent kernels!
I’ve been creating a project that uses interaction between two different kinds of particles! In that project i have a mechanism that arranjes the particle interactions, but it is limited to a block! So i thought to myself why not use concurrent kernels to make many indenpedent samples of interactions! In that context i have tested the system with streams and without streams! The streams versions only gives 1.5x of improvement!
So if i have created the little version above to see what happens! In that context i’m asking your help because in my project everything is independent and the memory used is primarly shared!

Thanks for you help!
Bruno Faria

Topic		Replies	Views
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	905	June 18, 2010
concurrent kernels CUDA Programming and Performance	2	848	May 2, 2011
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2225	October 26, 2016
Why kernel executions in different streams are not parallel? CUDA Programming and Performance	4	2460	April 29, 2019
Concurrent Kernel using GTX 570 on WinXp Concurrent Kernel CUDA Programming and Performance	0	673	October 10, 2011
My streams are not running concurrently CUDA Programming and Performance	7	1748	March 6, 2018
Kernel execution time increase 4x when using streams CUDA Programming and Performance	8	1686	August 13, 2015
Inconsistent CUDA Kernel Execution Times in Sequential Execution CUDA Programming and Performance cuda	6	191	June 11, 2024
Kernel executed in non-default CUDA stream waits for other streams to complete cudaMemcpyAsync CUDA Programming and Performance cuda	15	66	November 18, 2024
Concurrent kernel timing with cudaEvents CUDA Programming and Performance	1	1862	April 27, 2017

Concurrent Kernel execution overhead

Related topics