kernel launch overhead for GTX 280

Hi all,
I calculated the kernel launch overhead for GTX 280 around 60microsecond, but I’ve seen papers reporting this time around 10microsecond (including thread synchronizing). Does this kernel lunch overhead depends on the size of grid and thread blocks? what is the standard way to calculate this kernel lunch overhead?
Thanks

No any reply for this post?!! please guys…

A very long time ago, I looked at the launch overhead of 8800 GTX as a function of the number of blocks and found it to be linearly increasing. I haven’t looked at it since then. It is a very simple benchmark to perform just time an empty kernel with increasingly larger block sizes and plot the results. I suggest you try it and post the plot!

I found this paper [url=“http://www.netlib.org/lapack/lawnspdf/lawn202.pdf”]http://www.netlib.org/lapack/lawnspdf/lawn202.pdf[/url] which in section 3.1 it says the kernel lunch time is around 10microsec for OLD gpus, I measured this time as 60microsec for GTX 280, I supposed in 280 the kernel lunch time must be lower or at least equal to older gpus, am I right?

The launch overhead (for small kernels) has steadily improved as CUDA has matured. Current hardware and CUDA 2.3 typically gets a launch overhead of around 7-8 microseconds.

You aren’t running on windows Vista or windows 7, but chance? The overhead on those platforms is much higher compared to linux.

My lunch time is around 45 minutes to an hour, but then I don’t really like to eat kernels as I am not a vegetarian. ;-)

Christian

I’m running on windows XP64, and CUDA 2.3! This could be the reason? :blink:

good one! External Media

XP64 launch overhead is <10 microseconds, depending on your PCIe bandwidth. Vista/Win7 is… not.

I have done the test a week ago, here is some number in gtx280 (forget the datasize, it is an empty function call)

Empty kernel call only:
datasize(bytes) blockdim griddim time(s)
1 (1,1,1) (1,1,1) 0.000015
2 (2,1,1) (1,1,1) 0.000014
4 (4,1,1) (1,1,1) 0.000016
8 (8,1,1) (1,1,1) 0.000015
16 (16,1,1) (1,1,1) 0.000015
32 (32,1,1) (1,1,1) 0.000015
64 (64,1,1) (1,1,1) 0.000014
128 (128,1,1) (1,1,1) 0.000015
256 (256,1,1) (1,1,1) 0.000016
512 (512,1,1) (1,1,1) 0.000015
1024 (512,1,1) (2,1,1) 0.000015
2048 (512,1,1) (4,1,1) 0.000014
4096 (512,1,1) (8,1,1) 0.000015
8192 (512,1,1) (16,1,1) 0.000014
16384 (512,1,1) (32,1,1) 0.000017
32768 (512,1,1) (64,1,1) 0.000016
65536 (512,1,1) (128,1,1) 0.000017
131072 (512,1,1) (256,1,1) 0.000018
262144 (512,1,1) (512,1,1) 0.000022
524288 (512,1,1) (1024,1,1) 0.000029
1048576 (512,1,1) (2048,1,1) 0.000039
2097152 (512,1,1) (4096,1,1) 0.000064
4194304 (512,1,1) (8192,1,1) 0.000115
8388608 (512,1,1) (16384,1,1) 0.000212

If these times include the cudaThreadSynchronize() too?

Is that possible to share your code to do some comparison?

I inserted my code here. For a GTX 280 and on Windows XP 64, the resulted time including cudaThreadSynchronize() is 0.48millisec, and excluding cudaThreadSynchronize() is 0.039millisec!! Is it a big time?! :blink:

Thanks.

__global__  

void HelloCUDA2(float* device_result1)

{

	

}

int main(int argc, char* argv[])

{

	float *d_A;

	int N=1500;

	CUDA_SAFE_CALL( cudaMalloc((void**) &d_A, sizeof(float) * N * N));

	unsigned int timer = 0;

	CUT_SAFE_CALL( cutCreateTimer( &timer));

	CUT_SAFE_CALL( cutStartTimer( timer));

	HelloCUDA2<<<(N+127)/128, 128>>>(d_A);

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

	

	CUT_SAFE_CALL( cutStopTimer( timer));

	printf("Processing time: %f (ms)\n", cutGetTimerValue( timer));

	CUT_SAFE_CALL( cutDeleteTimer( timer));

	CUDA_SAFE_CALL( cudaFree(d_A));

	return 0;

}

For my case it is much more than 10microsec!! I inserted my code at gshi's response…please take a look and see if I am doing something wrong!?!

Thank you.

First kernel launch takes longer.

Try cooking your kernel first. No, this is not another food joke.

Cooking means launching it once (e…g after the cudaMalloc), doing a cudaThreadSynchronize(). Only then do your timing.

yes, the first kernel needs to be outside the loop. Here is the code

#include <stdlib.h>

#include <stdio.h>

#include <cuda.h>

#include <unistd.h>

#include <sys/time.h>

#include <sched.h>

#define CUERR  do{ cudaError_t err; \

		cudaThreadSynchronize(); \

		if ((err = cudaGetLastError()) != cudaSuccess) { \

		printf("ERROR: CUDA error: %s, line %d\n", cudaGetErrorString(err), __LINE__); \

		exit(-1);													   \

		}}while(0)

double

gettime(void)

{

	struct timeval t;

	gettimeofday(&t, NULL);

	return t.tv_sec + 0.000001* t.tv_usec;

}

__global__ void

empty_kernel(void* p, int len)

{

}

int

kernel_launch_test()

{

	char* A = NULL;

	char* hostA = NULL;

	double t0, t1;

	unsigned int max_data_size = 1 << 29;

	int i;

	int repeat_num = 1000;

	char hostname[64];

	gethostname(hostname, 64);

	printf("Runing on host %s\n", hostname);

	cpu_set_t cpuset;

	CPU_ZERO(&cpuset);

	CPU_SET(0, &cpuset);

	int device = 1;

	cudaSetDevice(device);CUERR;

	int datasize =1;

	cudaMallocHost((void**)&hostA, max_data_size); CUERR;

	cudaMalloc((void**)&A, max_data_size); CUERR;

	cudaMemcpy(A, hostA, datasize, cudaMemcpyHostToDevice);CUERR;

	empty_kernel<<<1, 1>>>(A, datasize);

	printf("datasize(bytes)  blockdim	  griddim		time(s)\n");

	while( datasize <= max_data_size){

		int blocksize= (datasize <=512)? datasize: 512;

		int gridsize_x = datasize/ blocksize;

		int gridsize_y = 1;

		if (gridsize_x ==0){

			gridsize_x= 1;

		}

#define K64 (32*1024)

		if(gridsize_x > K64){

			gridsize_y = gridsize_x /K64;

			gridsize_x = K64;

		}

		dim3 griddim(gridsize_x, gridsize_y);

		t0 = gettime();

		for (i =0;i <repeat_num;i ++){

			//cudaMemcpy(A, hostA, datasize, cudaMemcpyHostToDevice);CUERR;

			empty_kernel<<<griddim, blocksize>>>(A, datasize);CUERR;

		}

		cudaThreadSynchronize();CUERR;

		t1 = gettime();

		double gpu_time = t1 - t0;

		printf("%10d\t(%d,%d,%d) \t(%d,%d,%d) \t%f\n",

			   datasize,

			   blocksize, 1,1,

			   griddim.x, griddim.y, griddim.z,

			   gpu_time/repeat_num);

		fflush(stdout);

		datasize *=2;

	}

	cudaFreeHost(hostA);CUERR;

	cudaFree(A); CUERR;

	cudaThreadExit();

	return 0;

}

int main()

{

	kernel_launch_test();

}

Thanks, I had completely forgotten this point…

I cooked the kernel, and now it is close to 20microsec! it is still big, isn’t it?!

This depends also on your CPU speed, I would assume. A lower clocked CPU might take 20microsec, a faster one only 15microsec to perform the same task.

Thanks for the code, BTW, what is your platform: linux or windows?!

I tried to compile your code, but it cannot find sched.h and unistd.h!

mine is AMD Phenom 9850, 2.5GHz, 4GB RAM, and PCIe bandwidth 8GB/S.

I just want to be sure the GPU works well with its most efficiency!