hello, NV experts:
I want to test the performance of tf32 tensor core, so I create 2 tests with cublass, and set tf32 through this interface of cublass:
cublasSetMathMode(blas_handle, CUBLAS_TF32_TENSOR_OP_MATH)
I test it on two type of cuda gpu: RTX A4000 and RTX3090 .
first, let’s check the parameters of RTX A4000:
Device 0: "NVIDIA RTX A4000"
CUDA Driver Version / Runtime Version 11.6 / 11.3
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 16109 MBytes (16891379712 bytes)
(48) Multiprocessors, (128) CUDA Cores/MP: 6144 CUDA Cores
GPU Max Clock rate: 1560 MHz (1.56 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
and then, let’s check the result:
second, let’s check the parameters of RTX 3090:
Device 0: "NVIDIA GeForce RTX 3090"
CUDA Driver Version / Runtime Version 11.6 / 11.3
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 24268 MBytes (25447170048 bytes)
(82) Multiprocessors, (128) CUDA Cores/MP: 10496 CUDA Cores
GPU Max Clock rate: 1695 MHz (1.70 GHz)
Memory Clock rate: 9751 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
and then, let’s check the test result:
an important assumption about tensor core for tf32 in ampere:
each SM hold 4 tensor cores, and each tensor core can execute 128 tf32-fma per cycle.
based on above test and assumption, we can found following laws:
1). the capability usage of cuda core, RTX 3090 is better than RTX A4000;
2). the capability usage of tensor core, RTX A4000 is better than RTX 3090;
3). RTX 3090 and RTX A4000 are both poor at tensor core;
So, I have following questions:
1). my assumption about tf32 tensor core, is it right ?
2). why RTX 3090 is better than RTX A4000 at cuda core, at the same time, why RTX A4000 is better than RTX 3090 at tensor core?
3). why RTX 3090 and RTX A4000 are both poor at tensor core?if my assumption is right.
is there anyone would like to tell me the secret?