The canonical way to trigger CUDA context initiaization used to be a call cudaFree(0). I don’t think that has changed? Any performance measurements on CUDA APIs should not include context initialization time. I would have thought that this is common knowledge eight years into CUDA’s public existence, but maybe not.
NVIDIA may want to consider adding a sticky post to these forums pointing this out.
Just installed the latest greatest v358.50 driver.
Of three mex functions, 2 are within 10% of the v344 drivers on Titan Black (One is 10% slower).
The 3rd, which is the most I/O intensive (largest # of inputs) is still 30% slower (7.1ms vs. 5.5ms) on the Titan Black with the v358 driver. This is timing AFTER the data is transferred to the GPU. ONLY difference is the driver.
Now I will re-install the Titan X and see if that offers any improvement.
And our entire algorithm running on Matlab with a mix of native gpuArray functions and mex CUDA functions, takes 2.5x longer running on the Titan X v358, than it did on Titan Black v344.
So it seems GeForce driver development still has a long way to go for Titan X.