Case study: [TensorCore backed Conv] What makes a huge "Stall Wait"?

Hi all,

I’ve built a single-layer NN model via tensorrt API[1], and ran the generated engine under Nsight-Compute on GTX2080, which gives the following diagram:

[img] (「Warp cycles per issued instruction」: 6.77 / 「Stall Wait」: 4.20 )

  • A huge "stall wait" was present.
  • The target kernel "trt_turing_h1688cudnn_128x128_ldg8_relu_exp_small_nhwc_tn_v1()" is from libnvinfer

The documenthttps://docs.nvidia.com/nsight-compute/pdf/NsightCompute.pdf describes “Stall Wait” as somewhat of data-dependency between related instructions, am I right?
Q1:What makes this huge “Stall Wait”?

Nsight-compute can sample stalls line-by-line:

  • Most of these "Stall Wait" is associated to HMMA.1688.F16 instructions
  • "HMMA.1688.F16" seems the instructions utilizing TensorCore.

[img] (HMMA.1688.F16 R12, R176, R198, R12 …)
Q2: Are these “Stalls Wait” caused by accumulating?

  • E.g. "R12 = R176 ﹒R198 + R12", may be stalled for R12, if some other thread "read-accumulate-write" R12

Q3: Is there anyway to optimize out part of these “Stalls Wait”?


  1. My single-layer NN model
// a NN models contains just one convolution
void buildSingleConvNN()
{
    Dims3 inputDims = {128, 128, 46};
    Dims3 outputDims = {128, 128, 46};
    DimsHW kernelSize = {3, 3};
    DimsHW stride = {1, 1};
    DimsHW padding = { kernelSize.h() / 2, kernelSize.w() / 2 };

    IBuilder* builder = createInferBuilder(gLogger);
    INetworkDefinition* network = builder->createNetwork();
    
    // input
    ITensor* data = network->addInput("Input", DataType::kFLOAT, inputDims);

    int nInputChannels = inputDims.d[0];
    int nOutputChannels = nKernels = outputDims.d[0];

    // convolution
    Weights convWeights = genWeights(kernelSize * kernelSize * nInputChannels * nKernels);
    Weights biasWeights = { DataType::kFLOAT, nullptr, 0 };
    IConvolutionLayer* conv = network->addConvolution(*data, nOutputChannels,
        kernelSize, convWeights, biasWeights);
    conv->setStride(stride);
    conv->setPadding(padding);
    
    // output
    conv->getOutput(0)->setName("Output");
    network->markOutput(*conv->getOutput(0));
    
    builder->setMaxBatchSize(1);
    builder->setMaxWorkspaceSize(1 * (1 << 30) /* 1G */);
    builder->setFp16Mode(true);
    ICudaEngine* engine = builder->buildCudaEngine(*network);
    
    // dump engine
    const char* name = "Conv.cu.eng";
    std::ofstream engineStream;
    engineStream.open(name, std::ios::binary | std::ios::out);

    IHostMemory* serialized = engine->serialize();
    engineStream.write((char*) serialized->data(), serialized->size());
    engineStream.close();
    serialized->destroy();
}