Programming Efficiently with the NVIDIA CUDA 11.3 Compiler Toolchain

Originally published at: Programming Efficiently with the NVIDIA CUDA 11.3 Compiler Toolchain | NVIDIA Technical Blog

The CUDA 11.3 release of the CUDA C++ compiler toolchain incorporates new features aimed at improving developer productivity and code performance. NVIDIA is introducing cu++flt, a standalone demangler tool that allows you to decode mangled function names to aid source code correlation. Starting with this release, the NVRTC shared library versioning scheme is relaxed to…

CUDA 11.3 significantly improves the performance of Ampere/Turing/Volta Tensor Core kernels.

298TFLOPS was recorded on A100 when benchmarking FP16 GEMM from CUTLASS, an open source CUDA DL/HPC library (GitHub - NVIDIA/cutlass: CUDA Templates for Linear Algebra Subroutines). This is 14% higher than CUDA 11.2. FP32(via TF32) GEMM is improved by 39% and can reach 143TFLOPS. The same speedup applies to the CONV kernels.

Also, see the discussion here: CUDA 11.3 significantly improved the performance of CUTLASS · Discussion #241 · NVIDIA/cutlass · GitHub

How do I use the toolkit to build CUDA for my custom x86-64 OS (Yocto Built) with the support of an Nvidia GPU card using my Ubuntu x86-64 host system?
Thanks.