How to make tensor cores work?

NRJJJ · May 17, 2023, 1:22pm

Hi!
I’m trying to make my network utilize my A100 tensor cores. To debug I’ve created simple model with single convolutional layer.

import torch.nn as nn
import torch
import nvidia_dlprof_pytorch_nvtx
from triton.testing import do_bench
import contextlib

class TestNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(64, 512, kernel_size=(3,3), padding=1)

    def forward(self, x):
        return self.conv1(x)

def run_step(opt, model):
    # with torch.cuda.amp.autocast():
    opt.zero_grad()
    x = torch.randn(64,64,128,128, dtype=torch.float16, requires_grad=True).to(device)
    model = model.half()
    out = model(x)
    loss = out.sum()
    loss.backward()
    opt.step()


if __name__ == '__main__':
    device = 'cuda:0'
    do_dlprof = True
    do_benchmark = False
    do_profiler = False
    profiler_path = ...
    torch.backends.cudnn.benchmark = True
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    if do_dlprof:
        nvidia_dlprof_pytorch_nvtx.init()

    model = TestNet()
    opt = torch.optim.Adam(model.parameters())
    model = model.to(device)

    torch.manual_seed(123)
    dlprof_ctx = torch.autograd.profiler.emit_nvtx(enabled=do_dlprof)
    with dlprof_ctx:
        profiler_ctx = torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU,
                                                          torch.profiler.ProfilerActivity.CUDA],
                                                          schedule=torch.profiler.schedule(skip_first=0,wait=0,warmup=5,active=10,repeat=1),
                    on_trace_ready=torch.profiler.tensorboard_trace_handler(dir_name=profiler_path),
                    record_shapes=True,
                    profile_memory=False,
                    with_stack=True) if do_profiler else contextlib.nullcontext()
        with profiler_ctx as p:
            for i in range(15):
                if do_benchmark:
                    do_bench(lambda: run_step(opt, model), warmup=10, rep=10)
                else:
                    run_step(opt, model)
                if do_profiler:
                    p.step()

I’ve already tried:

making dimensions of input tensor and in\out channels divisible by 8
mixed precision\half precision
switch to tf32
But no matter what I do, dlprof profiler tells that tensor cores are not used for convolution operation.

Xnip2023-05-17_21-19-521920×980 160 KB

At the same time pytorch profiler tells that tensor cores were used.

image (3)2414×1378 362 KB

If it helps, the name of used convolution kernel is ‘cutlass_tensorop_f16_s16816fprop_optimized_f16_128x128_32x3_nhwc’
Why the results of both profilers are inconsistent do the tensor cores work indeed?

NRJJJ · May 18, 2023, 12:17pm

The problem was connected with environment setup.
I’ve set up all libraries in my working conda environment but dlprof didn’t log any info about kernel usage. After I’had switched to nvidia docker image as it was recommended in dlprof installation guide, kernel info started logging, dlprof started correctly displaying tc usage as torch profiler and the problem has gone.

system · June 1, 2023, 12:18pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TensorRT 7 conv3d is not running on Tensor Cores TensorRT	7	1355	September 22, 2021
TensorRT 7 conv3d is not running on Tensor Cores Jetson Xavier NX tensorrt	16	1522	December 1, 2021
Conv2d and Tensor Cores TensorRT	5	1141	October 27, 2020
How to confirm Tensor Core is working or not in CuSPARSE GPU-Accelerated Libraries cuda	4	887	May 12, 2023
How to use DLProf plugin with Tensorboard? Frameworks	3	876	August 16, 2021
How to measure Tensor core utilization using NVIDIA profiling tools such as Nsight System, DLProf, nvprof etc TensorRT cudnn	4	1623	January 31, 2024
Low utilization of Tensor RT cores TensorRT	21	2285	December 18, 2021
nvprof seems to make inference slower, no tensor cores being used Jetson AGX Xavier	4	973	October 18, 2021
Is there tensorcore kernel for 3D convolution? cuDNN	3	2193	December 30, 2019
Is there tensorcore kernel for 3D convolution? Deep Learning (Training & Inference) mixed-precision	1	944	November 25, 2019

How to make tensor cores work?

Related topics