Hey guys, I’d like to know something about kernel launches:
If I launch my kernel as follows:
kernel<<<A, B, 0>>>;
cudaThreadSynchronize();
Will there be always EXACTLY A*B kernel executions, no matter what the device?
I thought it would, but this one algorithm I made is absurdly fast and apparently loops through 57 billion kernel launches per second on a gtx 260, and thats not even optimised code! I’m actually wondering if all of those kernels are actually executing :wacko:
ok, i didn’t understand you in your first post, as i thought that one kernel launch is one call x_gen<<<GRID_SIZE, THREAD_SIZE, 0>>>(0) for all threads. 35G thread launches per second is possible, gxt260 perform 192 * 1.2G clock cycles if you count all thre cores, so if thread don’t do to much and it writes not much it can process threads that fast;)
OK I found out what the problem was, (I think). I commented out my entire kernel code and the exec time was the same - essentially I wasn’t returning any values to the kernel calling code because I just wanted to benchmark, so I’m guessing the smart Nvidia compiler compiled out my entire kernel because no results were actually being sent through!!! Who would have thought…