NVIDIA_TF32_OVERRIDE=0 not disabling TF32 in cublas

whatdhack · January 11, 2022, 9:58pm

Inspite of setting NVIDIA_TF32_OVERRIDE=0 see the following.

I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

How should TF32 be disabled then ?.

mnicely · January 12, 2022, 2:00am

How do you know you’re executing on Tensor Cores?

You could try changing CUBLAS_TF32_TENSOR_OP_MATH here to CUBLAS_DEFAULT_MATH

whatdhack · January 12, 2022, 3:00am

Thanks for confirming that there is no way to turn off TF32 in CUBLAS without rebuilding TF.

mnicely · January 12, 2022, 3:06am

That’s not exactly what I said.

I merely offered a suggestion to verify what you’re seeing.

I’m curious how you know the NVIDIA_TF32_OVERRIDE=0is not working? It’s possible there is a bug.
Can you provide a minimal reproducer?

whatdhack · January 12, 2022, 3:11am

Ahh, I see. Because the following is logged. Does that line not imply TF32 being used or is it a spurious/fake log ?

“I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.”

mnicely · January 12, 2022, 3:17am

I think I understand the confusion now.
You’re seeing a runtime log, which is trigger by the fact the data type is float.
If you set NVIDIA_TF32_OVERRIDE=0 doesn’t mean the log record goes away.
You need to profile your code with and without the override code.

You might try utilize cuBLAS logging to see what kernels are being called.

whatdhack · January 12, 2022, 3:42am

By using CUBLASLT_LOG_LEVEL=5 , only see the following kernels in both NVIDIA_TF32_OVERRIDE=0/1

[cublasLtCreate]
[cublasLtCtxInit]
[cublasLtSSSMatmulAlgoGetHeuristic]
[cublasLtSSSMatmul]

Both seem to use the following

2022-01-12 03:29:14][cublasLt][1420][Api][cublasLtSSSMatmulAlgoGetHeuristic] Adesc=[type=R_32F rows=200 cols=200 ld=200] Bdesc=[type=R_32F rows=200 cols=128 ld=200] Cdesc=[type=R_32F rows=200 cols=128 ld=200] preference=[maxWavesCount=0.0 gaussianModeMask=3M_MODE_DISALLOWED pointerModeMask=0 maxWorkspaceSizeinBytes=4194304 minBytesAlignmentA=16 minBytesAlignmentB=16 minBytesAlignmentC=16 minBytesAlignmentD=16 smCountTarget=108] computeDesc=[computeType=COMPUTE_32F_FAST_TF32 scaleType=R_32F]

mnicely · January 13, 2022, 2:06pm

This might help tf.config.experimental.enable_tensor_float_32_execution | TensorFlow v2.10.0