Bfloat16 has worse performance than float16 for conv2d in Pytorch

Hi,

I just compared the performance of my Pytorch model with different parameter data types, and I found that using bfloat16 would get worse performance than float16. Is it expected or not? Here is my experimental setting.

stable-ubuntu2004
Python version: 3.8.13
Torch version: 11.1.0
OFED: 5.4.3.0.3.0
32 A100 GPUs
Cuda: V11.3.109
NCCL version: 2.10.3

For the further investigation, I used simple models to compare the performance of bfloat16 and float16 (in this job, we use 8 A100 GPUs with NCCL version: 2.8.4, the other setting is the same as the above).

Exp:

input = torch.randn(20, 16, 500, 1000)
m = nn.Conv2d(16, 33, 3, stride=2)

torch.cuda.synchronize()
import time
t = time.time()

for _ in range(1000):
    output = m(input)
orch.cuda.synchronize()

time_elapse = time.time() - t
print(f"time_elapse = {time_elapse}")

For the experiment, the execution time of float16/bfloat16/float32 was 2.1/3.8 /3.2 s. It seems that for conv2, the performance of bfloat16 was even worse than float32. Is this expected?

On the other hand, I profiled my experiment using Nvidia Nsight Systems. The diagram showed that different kernels were applied for float16 and bfloat16. Did I run model with bfloat16 correctly? Is bfloat16 supported by the similar kernels as float16?

bfloat16 profiling result
35d2741924eafe2b2d66021a8008c5302bb74118

float16 profiling result

It’s expected that the kernels would probably be different. The types are not interchangeable. But I wouldn’t be able to explain all the differences without studying the torch source code. NVIDIA doesn’t develop, support, or maintain torch.

For non-tensorcore ops (perhaps “elementwise_kernel”), A100 (eg. table 1) has twice the performance on FP16 as compared to BF16, so I think it’s possible that in some cases FP16 might have somewhat higher perf than BF16. I don’t think that explains what you’re seeing, however.

If you’d like support for torch, one option is the pytorch discussion forum. You can also ask framework questions (including about pytorch) on the frameworks forum.

pytorch is likely doing the conv2d op using cudnn (for CUDA backend, anyway). If you want to develop a standalone cudnn test case, you could ask about it on the cudnn forum.

1 Like

How could we tell if a ops is a tensorcore ops or a non-tensorcore ops?

For more detail, I just show more details about the profiling result.

bfloat16

float16

  1. the naming of kernels provided by NVIDIA (e.g. in CUBLAS) often follows a convention when tensorcore (TC) is used. look for mma in the name.

  2. The profiler (nsight compute) can identify TC ops. Therefore you would profile the app kernel by kernel in nsight compute, and determine, on a kernel-by-kernel basis, whether TC instructions were executed as part of the kernel

  3. Using the kernel name, you could use a binary inspection utility (cuobjdump -sass ...) on the library that holds that kernel, and see if there are tensor core instructions (again, with mma in them) in the SASS dump.