TF32 GEMM sample very slow compared to generic GEMM

On a 3090,

Something must be wrong with the example somewhere, right? Citing throughput from https://arxiv.org/pdf/2206.02874.pdf

Edit: The TFlops is much lower than it should be on the tf32 example

tf32TensorCoreGemm:

Initializing...
GPU Device 0: "Ampere" with compute capability 8.6

M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 4096 (8 x 512)
Preparing data for GPU...
Required shared memory size: 72 Kb
Computing using high performance kernel = 0 - compute_tf32gemm_async_copy
Time: 214.088699 ms
TFLOPS: 2.57

cudaTensorCoreGemm:

Initializing...
GPU Device 0: "Ampere" with compute capability 8.6

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm 
Time: 2.307072 ms
TFLOPS: 59.57

immaTensorCoreGemm:

Initializing...
GPU Device 0: "Ampere" with compute capability 8.6

M: 4096 (16 x 256)
N: 4096 (16 x 256)
K: 4096 (16 x 256)
Preparing data for GPU...
Required shared memory size: 64 Kb
Computing... using high performance kernel compute_gemm_imma 
Time: 1.036288 ms
TOPS: 132.63

bf16TensorCoreGemm:

Initializing...
GPU Device 0: "Ampere" with compute capability 8.6

M: 8192 (16 x 512)
N: 8192 (16 x 512)
K: 8192 (16 x 512)
Preparing data for GPU...
Required shared memory size: 72 Kb
Computing using high performance kernel = 0 - compute_bf16gemm_async_copy
Time: 20.146175 ms
TFLOPS: 54.58

Providing some pointers as to which of the data is considered wrong or in some sort of conflict with each other would go a long way in orienting readers of this thread.

I figured copying the subject line would be a waste, edited

I haven’t read the paper thoroughly/carefully. I imagine the code base you’re referring to is identified in the paper but I didn’t see an actual reference to it. (Yes, I can see they say "We provide … " but I didn’t see anything beyond that. Probably I missed it, although searching for “github” in the paper didn’t turn anything relevant up.)

I don’t know if there is a problem in the codebase or not. If it were me, and I was interested in TF32 throughput benchmarking on 3090, I would just write a code around a single CUBLAS call. It’s probably also possible to use CUTLASS. I don’t have a 3090, and I don’t imagine that the 3090 is actually somehow crippled in TF32 performance or that it doesn’t conform to published specifications.

The peak TF32 non-sparsity throughput for 3090 seems to be ~35TF.

Oh, I’m very sorry, I only linked to the paper because they were able to get much higher throughput on a 3070ti than the 3TFlops TF32 run from the cuda-samples. This was just to say that the expected throughput drop on a 3070ti was to go from 257.7 ops/clk/sm at fp16 to 126 ops/clk/sm for tf32. So I was guessing it couldn’t have been because of crippled performance or lack of tensor core support for tf32.

I was curious if there were any guesses why the cuda-samples for tf32 was so much slower than all of the rest, it seems like there must be some critical problem to get a number that low.

Since you mentioned CUTLASS, I’m trying to run a very wide forward only convolution network, with rngs generated on demand for the filters. It seems philox for curand_normal4 only generates rngs at twice the rate of global memory. Do you think it’s worth keeping it in shared memory using custom wmma code or would CUTLASS work on this use case? Or is this computationally bottle-necked enough that instead it might be worth filling an array with rngs in a kernel and using cudnn? Very curious what your intuition might be!

I profiled the bf16 cuda-samples example and the tf32 example

bf16TensorCoreGemm:

tf32TensorCoreGemm:

I guess it all makes sense now