Because of the general issue of kernel launch overhead (this overhead has shrunk very slowly over the past 15 years) a long-standing general recommendation for CUDA programmers is to avoid kernels with extremely short execution times.
Have you already explored ways of stuffing more work into each kernel, e.g. by exposing additional parallelism at the expense of an increased count of total floating-point operations? The possibility of such trade-offs obviously does not apply to all iterative algorithms.
An orthogonal approach to increase performance independent of the driver used would investigate ways of reducing the number of steps required in the iterative algorithm, e.g. via some sort of convergence acceleration scheme, or changes to the floating-point precision used in intermediate steps. Again, this is applicable to some iterative algorithms and not others.
Does the CPU have to do intermediate steps between the kernel runs?
If not, and you use the kernel launches for grid-wide synchronization, you could try to launch a cooperative kernel instead and do grid-wide synchronization on the device.
Do all threads in the kernel need the result of the previous iteration of all other threads? Or is it only a local dependence?
Then you can use overlapping areas to calculate and synchronize only locally (e.g. SM wide).
I am very familiar with the topic. The current algorithm is already the result of a long work of profiling and merging kernels that could benefit from it. I think I cannot optimize more.
Same answer, the kernel pipeline is optimal (AFAIK) and use as much parallelism as possible.
Now, donāt be fooled by the short execution times that I exhibit. Donāt forget that it happens when my input data is light. For a āregularā input data stream, I canāt complain at all about performance under WDDM (kernel launched times are then negligible in comparison with process times).
My grief against WDDM appeared when observing what happens when the input data stream results in smaller packets of data to process.
In that case, there is clearly a bottleneck that is only due to WDDM. And it is an āartificialā limitation since switching to TCC fixes the thing.
Thatās why I think that the topic is not my algorithm, but the runtime+driver system that is not optimal comapred to what it could be, if only NVidia released a TCC mode for GeForce.
The history of the TCC driver strongly indicates that this is not going to happen. We need to deal with reality.
A kernel that takes a performance hit from WDDM also takes a performance hit from kernel launch overhead in general, and increasingly so as faster GPUs are deployed in the future. That was the motivation behind my questions: Are there steps that could be taken to improve the general exposure of this code to the launch latency issue.
Duly noted :-) The famous meme āon the internet nobody knows that you are a dogā has a corollary: On the internet nobody knows that you are an expert.
Yes, thatās why I had hopes with MCDM and WDDM low-latency (Yes I am artifically increasing the number of links to that thread just in case somebody at NVidia will notice it and take time to answer)
And you are right : there could have been be some ways that I did not even know I could explore. But that did not happen this time, it seems I am still bound to the WDDM limitation :-(
I might still try to create āsuper kernelsā that just gather work from subkernels, but at the cost of more memory for the temporary buffers that will have to be duplicated for some concurrent (rather than pipelined) sub-kernels. This is a solution I wanted to avoid because of a lot of template instanciation and (very) long compile times, that will be even more critical with those superkernels.
(and this is a LOT of work and glue-code compared to the very few opportunites were it could take place)
A follow up, just to add some interesting information.
As I said, I still had the opportunity the merge a few kernels, with something like that
__device__ inline _myKernel(dst, src)...
__global__ myKernel(dst, src) {_myKernel(dst, src);}
myKernel<<<gridSize(W, H, 1), blockSize(w, h, 1), 0, stream>>>(dst, src)
//creating a "super Kernel" to run a few instances in parallel when possible regarding input
//using the grid.z dimension
__global__ myMultiKernel(dsts, srcs) {_myKernel(dsts[z], srcs[z]);}
myMultiKernel<<<gridSize(W, H, N), blockSize(w, h, 1), 0, stream>>>(dsts, srcs)
The cost is hidden in the building of ādstsā and āsrcsā arguments to gather buffers together, use more buffers since temporary storage cannot be shared, and probably less efficient L2 cache usage (not sure about that).
Since my goal is to prove that WDDM kernel launch has too much latency, I expected a small but visible improvement since I was then able to save a few kernel launches.
The first result was terrible : my overall performance was cut in half ! Not an improvement, a huge regression.
It seems that the extra bytes used by the ālargerā arguments of my instances of myMultiKernel , and the implied greater pressure on registers, was a performance killer.
Just to be sure, I replaced those ālargerā structures (to be honest, arrays of 4 pointers instead of 1) by the usage of a __device__ __constant__ symbol where I pre-cudamemcpying the arguments before launching the kernel, thus reducing the number of bytes of the arguments for the actual multiKernel launch.
In that case, I get the expected behaviour : a little performance improvement compared to the non-multiKernel version.
So, while trying to challenge the limits of WDDM, I encountered another subtle hidden cost regarding kernel arguments.