TF32 GEMM sample very slow compared to generic GEMM

cpdurham · June 29, 2022, 2:55am

On a 3090,

Something must be wrong with the example somewhere, right? Citing throughput from https://arxiv.org/pdf/2206.02874.pdf

Edit: The TFlops is much lower than it should be on the tf32 example

tf32TensorCoreGemm:

Initializing...
GPU Device 0: "Ampere" with compute capability 8.6

M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 4096 (8 x 512)
Preparing data for GPU...
Required shared memory size: 72 Kb
Computing using high performance kernel = 0 - compute_tf32gemm_async_copy
Time: 214.088699 ms
TFLOPS: 2.57

cudaTensorCoreGemm:

Initializing...
GPU Device 0: "Ampere" with compute capability 8.6

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm 
Time: 2.307072 ms
TFLOPS: 59.57

immaTensorCoreGemm:

Initializing...
GPU Device 0: "Ampere" with compute capability 8.6

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm_imma 
Time: 1.036288 ms
TOPS: 132.63

bf16TensorCoreGemm:

Initializing...
GPU Device 0: "Ampere" with compute capability 8.6

M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 8192 (16 x 512)
Preparing data for GPU...
Required shared memory size: 72 Kb
Computing using high performance kernel = 0 - compute_bf16gemm_async_copy
Time: 20.146175 ms
TFLOPS: 54.58

njuffa · June 29, 2022, 6:36am

Providing some pointers as to which of the data is considered wrong or in some sort of conflict with each other would go a long way in orienting readers of this thread.

cpdurham · June 29, 2022, 11:24am

I figured copying the subject line would be a waste, edited

Robert_Crovella · June 29, 2022, 9:24pm

I haven’t read the paper thoroughly/carefully. I imagine the code base you’re referring to is identified in the paper but I didn’t see an actual reference to it. (Yes, I can see they say "We provide … " but I didn’t see anything beyond that. Probably I missed it, although searching for “github” in the paper didn’t turn anything relevant up.)

I don’t know if there is a problem in the codebase or not. If it were me, and I was interested in TF32 throughput benchmarking on 3090, I would just write a code around a single CUBLAS call. It’s probably also possible to use CUTLASS. I don’t have a 3090, and I don’t imagine that the 3090 is actually somehow crippled in TF32 performance or that it doesn’t conform to published specifications.

The peak TF32 non-sparsity throughput for 3090 seems to be ~35TF.

cpdurham · June 29, 2022, 10:45pm

Oh, I’m very sorry, I only linked to the paper because they were able to get much higher throughput on a 3070ti than the 3TFlops TF32 run from the cuda-samples. This was just to say that the expected throughput drop on a 3070ti was to go from 257.7 ops/clk/sm at fp16 to 126 ops/clk/sm for tf32. So I was guessing it couldn’t have been because of crippled performance or lack of tensor core support for tf32.

I was curious if there were any guesses why the cuda-samples for tf32 was so much slower than all of the rest, it seems like there must be some critical problem to get a number that low.

Since you mentioned CUTLASS, I’m trying to run a very wide forward only convolution network, with rngs generated on demand for the filters. It seems philox for curand_normal4 only generates rngs at twice the rate of global memory. Do you think it’s worth keeping it in shared memory using custom wmma code or would CUTLASS work on this use case? Or is this computationally bottle-necked enough that instead it might be worth filling an array with rngs in a kernel and using cudnn? Very curious what your intuition might be!

cpdurham · June 30, 2022, 6:50pm

I profiled the bf16 cuda-samples example and the tf32 example

bf16TensorCoreGemm:

tf32TensorCoreGemm:

I guess it all makes sense now

Topic		Replies	Views
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	2140	August 8, 2023
Cudnn TF32 performs no better than FP32 on RTX3090 TensorRT	1	746	January 15, 2021
Performace on A100SXM40GB TF32 vs FP32 CUDA Programming and Performance cuda , ampere	1	1097	January 26, 2023
A40 and 3090 GEMM performance test data Frameworks (archived) cuda , ubuntu	0	1075	April 11, 2023
TF32 TFLOPs of GeForce RTX 3090 vs A40 CUDA Programming and Performance	2	3179	September 11, 2023
Question about tensor cores performance CUDA Programming and Performance	3	826	October 12, 2021
Cudnn TF32 performs no better than FP32 on RTX3090 cuDNN cudnn	5	2624	January 28, 2021
why the Tesla T4 peak performance test result mismatch with the official doc CUDA Programming and Performance	8	2687	October 19, 2019
CUDA lib performance on Ampere architecture CUDA Programming and Performance	2	882	April 22, 2021
Fp32 & a100 GPU-Accelerated Libraries cublas	3	846	December 16, 2021

TF32 GEMM sample very slow compared to generic GEMM

Related topics