Exactly how fast is this baby? (pretty fast, and don't call me baby)

Hey guys, I’d like to know something about kernel launches:

If I launch my kernel as follows:

kernel<<<A, B, 0>>>;
cudaThreadSynchronize();

Will there be always EXACTLY A*B kernel executions, no matter what the device?

I thought it would, but this one algorithm I made is absurdly fast and apparently loops through 57 billion kernel launches per second on a gtx 260, and thats not even optimised code! I’m actually wondering if all of those kernels are actually executing :wacko:

I’ll check the profiler output in the meantime

Thx!

57 billion kernel launches per second

wow, it is 20 times per your cpu clock cycle;) maybe you have some error in speed calculation?

Well here is my code:

////////////////////////////////////////////////////////////////////////////////////////////////////

#define GRID_SIZE 1024*64-1

#define THREAD_SIZE 512

#define LOOP_N_SIZE 1024

//kernel declaration:

global static void x_gen(word temp) { /* code */ … }

//main code snippet:

unsigned int timer = 0;

CUT_SAFE_CALL( cutCreateTimer( &timer));

CUT_SAFE_CALL( cutStartTimer( timer));

for (int i=0; i<LOOP_N_SIZE; i++)

{

	x_gen<<<GRID_SIZE, THREAD_SIZE, 0>>>(0);

	CUT_CHECK_ERROR("Kernel execution failed\n");

	CUDA_SAFE_CALL( cudaThreadSynchronize() );

}

CUT_SAFE_CALL( cutStopTimer( timer));

cout << "Processing time: " << cutGetTimerValue( timer ) << "ms\n";

cout << fixed << showpoint << setprecision(2) <<

	"x's generated per millisecond: " << 

	double(GRID_SIZE)*double(THREAD_SIZE)*double(LOOP_N_SIZE)/

	(double(cutGetTimerValue( timer) ) ) <<

	endl;

////////////////////////////////////////////////////////////////////////////////////////////////////

// output: 35m per MILLISECOND!!?

Processing time: 970.456ms

x’s generated per millisecond: 35405237.44

ok, i didn’t understand you in your first post, as i thought that one kernel launch is one call x_gen<<<GRID_SIZE, THREAD_SIZE, 0>>>(0) for all threads. 35G thread launches per second is possible, gxt260 perform 192 * 1.2G clock cycles if you count all thre cores, so if thread don’t do to much and it writes not much it can process threads that fast;)

Could you post your kernel code?

OK I found out what the problem was, (I think). I commented out my entire kernel code and the exec time was the same - essentially I wasn’t returning any values to the kernel calling code because I just wanted to benchmark, so I’m guessing the smart Nvidia compiler compiled out my entire kernel because no results were actually being sent through!!! Who would have thought…

Don’t worry, almost everyone faces this problem sooner or later :)

Yes, if your kernel writes nothing to global memory then it will be optimized out and your code will just launch emtpy kernels.