GPU cuda cores or Tensor cores

eyalhir74 · May 10, 2020, 9:15am

Hi,
I’m using a 3x512x512 input tensor for a convolution layer.

When the kernel is 7x7 it runs at ~2.5ms and nvprof shows this:
GPU activities: 100.00% 26.878ms 10 2.6878ms 2.5754ms 2.9829ms void fused::fusedConvolutionReluKernel<fused::SrcChwcPtr_FltTex_Reader<float, int=1, int=1, int=1>, fused::KpqkPtrWriter<float, int=1, int=1>, float, float, int=3, int=4, int=1, int=7, int=7, int=1, int=1>(fused::ConvolutionParams<floatSrcType, int=1, int=1Type>)

When the kernel is 5x5 it runs at ~1.2ms and nvprof show this:
GPU activities: 100.00% 11.011ms 10 1.1011ms 1.0990ms 1.1058ms trt_volta_scudnn_128x32_relu_small_nn_v1

How do I know which implementation runs on the CUDA cores and which on Tensor cores?
Why the different implementation paths? Does it matter or scudnn means it runs on the CUDA cores(slower on Xavier) and the fusedConvolutionReluKernel on Tensor cores?

I understand that 5x5 should run faster - I’m mostly baffled as to how to know I’m using the Tensor cores and not the slower CUDA cores.

thanks
Eyal

AastaLLL · May 12, 2020, 5:55am

Hi,

1. There are two operation supports by Tensor Cores: HMMA and IMMA.

Please run it with nvprof:

tensor_precision_fu_utilization : The utilization level of the multiprocessor function units that execute tensor core instructions on a scale of 0 to 10 (HMMA)
tensor_int_fu_utilization : The utilization level of the multiprocessor function units that execute tensor core int8 instructions on a scale of 0 to 10. This metric is only available for device with compute capability 7.2. (IMMA)

2. In general, TensorRT or cuDNN will automatically select an optimized implementation for you.
This will take GPU layerout and capacity into consideration so different input size may lead to different algorithm.

Thanks.

Topic		Replies	Views
CUDA cores vs Tensor Cores Jetson AGX Xavier cuda , nvbugs	16	5000	October 18, 2021
Tensor Cores Jetson AGX Xavier	8	1485	October 18, 2021
What's the difference between Cuda Cores kernels (icudnn, hcudnn and scudnn) and Tensor Cores Kernels (h884 and i8816)? TensorRT	3	1553	October 12, 2021
Concurrent execution of CUDA and Tensor cores CUDA Programming and Performance	34	9376	November 3, 2024
Programming Tensor Cores in CUDA 9 Technical Blog	14	1297	November 28, 2022
Am I using Tensor Core? CUDA Programming and Performance	3	809	August 4, 2021
Tips for Optimizing GPU Performance Using Tensor Cores Technical Blog	15	1272	July 24, 2019
How to optimize the tensorRT Engine for Tensor Core? Jetson AGX Orin tensorrt , nvbugs	21	1944	August 2, 2023
Tensorcore identification while running inference Jetson AGX Orin tensorflow	2	482	December 20, 2022
Nsight Profile of NVIDIA/CUDALibrarySamples/cuTENSOR. Does it use CUDA Programming and Performance	4	588	November 22, 2022

GPU cuda cores or Tensor cores

Related topics