I’ve been trying to trace this all day… Now slightly frustrated. :)
Timing the launch of a single null kernel shows it is taking over 1 ms on this machine. Scaled up in dimensions, the launching of the collection of the same null kernel (for debugging it just returns immediately) is taking 16 ms. (10240 blocks x 128 threads). That is 100x slower than Matlab doing the full operation. Surely this can’t be right?
From what I’ve been reading, the time to launch a kernel should be of the order of 20 us?
If I butcher one of the examples in the CUDA 7.0 installation, and do the same timing of a single kernel launch it is the same.
I’ve only just started out with CUDA, so I don’t know what information I should provide, but here is what I can start with:
MSVC 2013 Professional C++
I7 at 2.8 GHz
Code generation is set as “compute_20,sm_20;compute_30,sm_30”
The GPU is running at about 50 degC, and has 60% memory free.
This is the way I am timing the launch:
cudaDeviceSynchronize(); cudaEventRecord(start); kSumSq <<< 1,1 >>>(d_mean_sq, d_in, n_est, estlen); cudaEventRecord(stop); cudaEventSynchronize(stop); cudaEventElapsedTime(×, start, stop);
This code has run much faster, (as in 35 ms for 262144 x 128 kernels running my full kernel code) but I cannot see what is different now.
If anyone could give any advice on how to go about finding out why this is, I’d be grateful.
Kind regards, Kevin