Which GPU for best performance with TCC and CUDA cores (no tensors)

Are all those kernels necessary to be separate kernels? Are all those kernels dependent on the results of each other?

Yes, this is a complex iterative numerical algorithm with many steps.
I expected that CUDA graphs could help, but it does not run faster.

It happens to be a very good test case to exhibit the problem of WDDM vs TCC.

Because of the general issue of kernel launch overhead (this overhead has shrunk very slowly over the past 15 years) a long-standing general recommendation for CUDA programmers is to avoid kernels with extremely short execution times.

Have you already explored ways of stuffing more work into each kernel, e.g. by exposing additional parallelism at the expense of an increased count of total floating-point operations? The possibility of such trade-offs obviously does not apply to all iterative algorithms.

An orthogonal approach to increase performance independent of the driver used would investigate ways of reducing the number of steps required in the iterative algorithm, e.g. via some sort of convergence acceleration scheme, or changes to the floating-point precision used in intermediate steps. Again, this is applicable to some iterative algorithms and not others.

Does the CPU have to do intermediate steps between the kernel runs?

If not, and you use the kernel launches for grid-wide synchronization, you could try to launch a cooperative kernel instead and do grid-wide synchronization on the device.


Do all threads in the kernel need the result of the previous iteration of all other threads? Or is it only a local dependence?

Then you can use overlapping areas to calculate and synchronize only locally (e.g. SM wide).

I am very familiar with the topic. The current algorithm is already the result of a long work of profiling and merging kernels that could benefit from it. I think I cannot optimize more.

Same answer, the kernel pipeline is optimal (AFAIK) and use as much parallelism as possible.

Now, donā€™t be fooled by the short execution times that I exhibit. Donā€™t forget that it happens when my input data is light. For a ā€œregularā€ input data stream, I canā€™t complain at all about performance under WDDM (kernel launched times are then negligible in comparison with process times).
My grief against WDDM appeared when observing what happens when the input data stream results in smaller packets of data to process.

In that case, there is clearly a bottleneck that is only due to WDDM. And it is an ā€œartificialā€ limitation since switching to TCC fixes the thing.

Thatā€™s why I think that the topic is not my algorithm, but the runtime+driver system that is not optimal comapred to what it could be, if only NVidia released a TCC mode for GeForce.

The history of the TCC driver strongly indicates that this is not going to happen. We need to deal with reality.

A kernel that takes a performance hit from WDDM also takes a performance hit from kernel launch overhead in general, and increasingly so as faster GPUs are deployed in the future. That was the motivation behind my questions: Are there steps that could be taken to improve the general exposure of this code to the launch latency issue.

Duly noted :-) The famous meme ā€œon the internet nobody knows that you are a dogā€ has a corollary: On the internet nobody knows that you are an expert.

Yes, thatā€™s why I had hopes with MCDM and WDDM low-latency (Yes I am artifically increasing the number of links to that thread just in case somebody at NVidia will notice it and take time to answer)

And you are right : there could have been be some ways that I did not even know I could explore. But that did not happen this time, it seems I am still bound to the WDDM limitation :-(

I might still try to create ā€œsuper kernelsā€ that just gather work from subkernels, but at the cost of more memory for the temporary buffers that will have to be duplicated for some concurrent (rather than pipelined) sub-kernels. This is a solution I wanted to avoid because of a lot of template instanciation and (very) long compile times, that will be even more critical with those superkernels.
(and this is a LOT of work and glue-code compared to the very few opportunites were it could take place)

You can also file a bug-report (enhancement) at Nvidia asking for the WDDM low-latency drivers.

I just did (4976844) Thanks for reminding me of that possibility (I was not sure that it would not be overlooked)

1 Like

A follow up, just to add some interesting information.
As I said, I still had the opportunity the merge a few kernels, with something like that

__device__ inline _myKernel(dst, src)...
__global__ myKernel(dst, src) {_myKernel(dst, src);}
myKernel<<<gridSize(W, H, 1), blockSize(w, h, 1), 0, stream>>>(dst, src)

//creating a "super Kernel" to run a few instances in parallel when possible regarding input
//using the grid.z dimension
__global__ myMultiKernel(dsts, srcs) {_myKernel(dsts[z], srcs[z]);}
myMultiKernel<<<gridSize(W, H, N), blockSize(w, h, 1), 0, stream>>>(dsts, srcs)

The cost is hidden in the building of ā€œdstsā€ and ā€œsrcsā€ arguments to gather buffers together, use more buffers since temporary storage cannot be shared, and probably less efficient L2 cache usage (not sure about that).

Since my goal is to prove that WDDM kernel launch has too much latency, I expected a small but visible improvement since I was then able to save a few kernel launches.

The first result was terrible : my overall performance was cut in half ! Not an improvement, a huge regression.
It seems that the extra bytes used by the ā€œlargerā€ arguments of my instances of myMultiKernel , and the implied greater pressure on registers, was a performance killer.

Just to be sure, I replaced those ā€œlargerā€ structures (to be honest, arrays of 4 pointers instead of 1) by the usage of a __device__ __constant__ symbol where I pre-cudamemcpying the arguments before launching the kernel, thus reducing the number of bytes of the arguments for the actual multiKernel launch.

In that case, I get the expected behaviour : a little performance improvement compared to the non-multiKernel version.

So, while trying to challenge the limits of WDDM, I encountered another subtle hidden cost regarding kernel arguments.

1 Like

That is good to keep in mind! Also your solution with using constant memory.