Concurrent execution of CUDA and Tensor cores

“CUDA core” is a marketing term. It is basically an execution unit performing single-precision fused multiply-adds. A GPU also has double-precision execution units, various integer execution units, multifunction units, and (another marketing name) Tensor cores. This is not dissimilar from execution units of various kinds in CPUs. Due to limited issue bandwidth, on both CPUs and GPUs, the totality of execution units are never all100% busy at any given time.

I don’t know exactly how nvidia-smi measures utilization, but presumably it reports 100% busy if instructions (of any kind, targeting any execution unit) are issuing in every single cycle. Whether that is a true aggregate across all warp schedulers or just sampled, I do not know.

Post scriptum: I notice belatedly that Robert Crovella has already pointed to an answer on Stackoverflow regarding nvdia-smi’s utilization metric.

1 Like

That is scary if true. If I do a small repeated continuously fed multiply of a single FP scalar value that only use one of the 16384 cores then is it really true that it’ll show the entire GPU as being 100% busy.

The analogy with the AVX isn’t valid. A single processor has many “execution units”, alu’s, floating point units, fetchers, etc. If ANY of those are executing user code over the full measurement interval then THAT SINGLE CORE is deemed to be 100% busy. However, the OS doesn’t say that the entire “SYSTEM” or single multi-core socket is busy. That is not how it works. I should note that 100% busy on a single core doesn’t mean 100% efficient. Performance guys know that you need to try to keep as many of the component execution units in the core as busy as possible. Compilers try to do this.

I’m still new but think I’ve heard that there are smaller blocks of CUDA processors that work together. It may be that you can’t tell the different between one core in a block being busy vs the entire block. I have no problem with that. But one ?block? is not the entire set of 16384 cores. I’m NOT going to argue about whether both the multipliers and adder are both active for a non-fuse operation. You shouldn’t argue about AVX not being in use. They are part of the SAME core.

If I had a way to guarantee my FP operation was being done on a Tensor core I could check how it works. But using Tenor cores appear to require learning than the trivial usage of CUDA cores. I will learn this but I have a long list to learn… Perhaps my next experiment is to create a medium sized pair of matrixs and multiple them in a loop. Medium sized so they don’t use all the CUDA core. Then we will see if it says 100% busy or 50% busy if I size the arrays correctly.

1 Like

Thanks, this is what I would assume based on my similar knowledge on CPU’s. But these execution units are per-core or perhaps ?block? of cores. One core or block of cores being busy, even with idle execution units is still a busy core or block. If there were 1024 blocks each with 16 core per block and only one was busy then .001 busy should be reported. That is how it works for Linux and regular cpus. One busy core, no matter how efficiently it is used is only one out of many and the total busyness of the GPU should be shown appropriately.

1 Like

Thanks! Experiments are good. I’ll take a look.

Everybody learns differently, but I always advocate hands-on exploration to gain a more intuitive understanding of a hardware platform.

I think it is also helpful to be able to read the machine code (SASS) produced by the compiler (cuobjdump --dump-sass). While NVIDIA provides a bare minimum of documentation for the hardware instructions, much of it should be understandable for someone previously exposed to low-level programming.

While it is possible to write fairly well optimized code without the profiler (after all CUDA programmers did just that for several years before the profiler became available in 2012 or so), getting acquainted with the CUDA profiler earlier rather than later is highly recommended. The profiler output is not always trivial to interpret (true for all profilers in my experience; I recall my struggles with VTUNE). For questions in that regard there is a dedicated sub-forum for profiler issues: Visual Profiler and nvprof - NVIDIA Developer Forums

I am like you but at the beginning of my GPU/DNN/AI journey. I spent 40+ years on the more conventional side of things now retired. 100% of my efforts is now to learn NVidia hardware and NN/AI. Thanks for the help.

I have been retired for a number of years after previously being on the team that created CUDA. I did processor design, low-level programming, and software optimization before that, and the switch to massively parallel programming was challenging for me: my first exposure to that was being tasked with the creation of the initial version of CUBLAS.

I am not sure the human brain is generally well-suited to mentally mapping of thousands of threads of execution to thousands of pieces of data, but CUDA is more intuitive than alternative approaches, I would claim. Good luck!

1 Like

I think this is of course very important and valuable… To make the max usage of GPU resources, let tensor core and cuda core work together is important. If only tensor core is working and cuda core is resting, it is very pitty…