Hi all,
I’ve built a single-layer NN model via tensorrt API[1], and ran the generated engine under Nsight-Compute on GTX2080, which gives the following diagram:
[img] (「Warp cycles per issued instruction」: 6.77 / 「Stall Wait」: 4.20 )
- A huge "stall wait" was present.
- The target kernel "trt_turing_h1688cudnn_128x128_ldg8_relu_exp_small_nhwc_tn_v1()" is from libnvinfer
The documenthttps://docs.nvidia.com/nsight-compute/pdf/NsightCompute.pdf describes “Stall Wait” as somewhat of data-dependency between related instructions, am I right?
Q1:What makes this huge “Stall Wait”?
Nsight-compute can sample stalls line-by-line:
- Most of these "Stall Wait" is associated to HMMA.1688.F16 instructions
- "HMMA.1688.F16" seems the instructions utilizing TensorCore.
[img] (HMMA.1688.F16 R12, R176, R198, R12 …)
Q2: Are these “Stalls Wait” caused by accumulating?
- E.g. "R12 = R176 ﹒R198 + R12", may be stalled for R12, if some other thread "read-accumulate-write" R12
Q3: Is there anyway to optimize out part of these “Stalls Wait”?
- My single-layer NN model
// a NN models contains just one convolution
void buildSingleConvNN()
{
Dims3 inputDims = {128, 128, 46};
Dims3 outputDims = {128, 128, 46};
DimsHW kernelSize = {3, 3};
DimsHW stride = {1, 1};
DimsHW padding = { kernelSize.h() / 2, kernelSize.w() / 2 };
IBuilder* builder = createInferBuilder(gLogger);
INetworkDefinition* network = builder->createNetwork();
// input
ITensor* data = network->addInput("Input", DataType::kFLOAT, inputDims);
int nInputChannels = inputDims.d[0];
int nOutputChannels = nKernels = outputDims.d[0];
// convolution
Weights convWeights = genWeights(kernelSize * kernelSize * nInputChannels * nKernels);
Weights biasWeights = { DataType::kFLOAT, nullptr, 0 };
IConvolutionLayer* conv = network->addConvolution(*data, nOutputChannels,
kernelSize, convWeights, biasWeights);
conv->setStride(stride);
conv->setPadding(padding);
// output
conv->getOutput(0)->setName("Output");
network->markOutput(*conv->getOutput(0));
builder->setMaxBatchSize(1);
builder->setMaxWorkspaceSize(1 * (1 << 30) /* 1G */);
builder->setFp16Mode(true);
ICudaEngine* engine = builder->buildCudaEngine(*network);
// dump engine
const char* name = "Conv.cu.eng";
std::ofstream engineStream;
engineStream.open(name, std::ios::binary | std::ios::out);
IHostMemory* serialized = engine->serialize();
engineStream.write((char*) serialized->data(), serialized->size());
engineStream.close();
serialized->destroy();
}