Low processor efficiency with almost same CUDA kernels

I have created three synthetic CUDA kernels, which are almost all doing only arithmetic operations. All three kernels are the same, except each of them does a different number of operations. Kernel #1 does 8 operations, Kernel #2 does 16 operations and Kernel #3 does 32. Here are the implementations of CUDA kernel for all three.

Kernel #1:

#ifndef kernelWGSXMAPIXLLXOPS8_H_
#define kernelWGSXMAPIXLLXOPS8_H_

__global__ void WGSXMAPIXLLXOPS8 (const float *GIn, float *GOut, const float M, const float N, const float P) {

        int gid = blockIdx.x * blockDim.x + threadIdx.x;

        float MF = (float) M;
  float NF = (float) N;
  float PF = (float) P;

  for (int lcdd = 0; lcdd < 1; lcdd++) {
    float temp1 = 1.0;
    temp1 = temp1 * MF + temp1;
    temp1 = temp1 * MF + temp1;
    temp1 = temp1 * MF + temp1;
    ... // 8 FMA operations
    temp1 = temp1 * MF + temp1;

    GOut[gid] = temp1;
  }

}

void WGSXMAPIXLLXOPS8_wrapper (const float *GIn, float *GOut,
                               const float M, const float N, const float P,
                               int numBlocks, int threadPerBlock) {
        WGSXMAPIXLLXOPS8<<<numBlocks, threadPerBlock>>> (GIn, GOut, M, N, P); 
}


#endif

Kernel #2:

#ifndef kernelWGSXMAPIXLLXOPS16_H_
#define kernelWGSXMAPIXLLXOPS16_H_

__global__ void WGSXMAPIXLLXOPS16 (const float *GIn, float *GOut, const float M, const float N, const float P) {

        int gid = blockIdx.x * blockDim.x + threadIdx.x;

        float MF = (float) M;
  float NF = (float) N;
  float PF = (float) P;

  for (int lcdd = 0; lcdd < 1; lcdd++) {
    float temp1 = 1.0;
    temp1 = temp1 * MF + temp1;
    temp1 = temp1 * MF + temp1;
    ... // 16 FMA operations
    temp1 = temp1 * MF + temp1;
    temp1 = temp1 * MF + temp1;

    GOut[gid] = temp1;
  }

}

void WGSXMAPIXLLXOPS16_wrapper (const float *GIn, float *GOut,
                               const float M, const float N, const float P,
                               int numBlocks, int threadPerBlock) {
        WGSXMAPIXLLXOPS16<<<numBlocks, threadPerBlock>>> (GIn, GOut, M, N, P); 
}

#endif

Kernel #3:

#ifndef kernelWGSXMAPIXLLXOPS32_H_
#define kernelWGSXMAPIXLLXOPS32_H_

__global__ void WGSXMAPIXLLXOPS32 (const float *GIn, float *GOut, const float M, const float N, const float P) {

        int gid = blockIdx.x * blockDim.x + threadIdx.x;

        float MF = (float) M;
  float NF = (float) N;
  float PF = (float) P;

  for (int lcdd = 0; lcdd < 1; lcdd++) {
    float temp1 = 1.0;
    temp1 = temp1 * MF + temp1;
    temp1 = temp1 * MF + temp1;
    temp1 = temp1 * MF + temp1;


    temp1 = temp1 * MF + temp1;
    temp1 = temp1 * MF + temp1;
    ... // 32 FMA operations
    temp1 = temp1 * MF + temp1;
    temp1 = temp1 * MF + temp1;

    GOut[gid] = temp1;
  }

}

void WGSXMAPIXLLXOPS32_wrapper (const float *GIn, float *GOut,
                               const float M, const float N, const float P,
                               int numBlocks, int threadPerBlock) {
        WGSXMAPIXLLXOPS32<<<numBlocks, threadPerBlock>>> (GIn, GOut, M, N, P); 
}

#endif

The total number of threads have been set to 16384, and block size is 256. I have calculated the total GFlops of each of these kernels and are equal to 20.44, 56.53, and 110.12 GFlops. I was trying to come up with an explanation, but nothing comes to my mind. So I tried using nvprof and monitored all metrics. All metrics are almost equal, Here are some of the metrics that look important to me (I also included results for kernel 1 to 3):

sm_efficiency_instance:   14.99, 16.78, 19.82 %
ipc_instance:             0.57 , 0.93 , 1.53   
inst_replay_overhead:     0.399, 0.268, 0.165
dram_write_throughput:    18.08, 17.72, 16.9 GB/s
issued_ipc:               0.99 , 1.18 , 1.52
issue_slot_utilization:   19.48, 24.64, 33.76 %
stall_exec_dependency:    21.84, 26.38, 42.95 %

As it’s clear, both of them have the same dram_write_throughput, since all are writing same amount of data to the DRAM, and the total number of threads is the same. What I don’t understand is sm_efficiency. My kernels are all doing arithmetics (the same), how come their sm_efficiency is not the same. Also, why having more arithmetic in the same kernel increases the efficiency? My understanding is, all of them should have the same problem for finding warps to locate on SM.

Can anyone help me to understand the difference of GFlops, using below metrics?

My guess is that you may be observing what is known as the tail effect.

Can you quickly throw 16x the number of threads at the GPU to see if the difference in sm_efficiency shrinks a lot?

Christian

this nvidia blog entry talks a bit about the tail effect

https://devblogs.nvidia.com/cuda-pro-tip-minimize-the-tail-effect/

20.44, 56.53, and 110.12 GFlops.
dram_write_throughput: 18.08, 17.72, 16.9 GB/s

it looks like you are bound by the memory bandwidth

cross posting:

https://stackoverflow.com/questions/49723378/low-processor-efficiency-with-almost-same-cuda-kernels

There seem to be at least 2 issues, both related to different kinds of latency:

  1. At the very small kernel sizes (e.g. 16384 total threads) the kernel execution time is short, so measurement is clouded by e.g. kernel launch latency.
  2. The kernel sizes, being very small, do not saturate the GPU with as much parallel work as can be delivered, and so things like IPC and sm_efficiency are lower than they need to be, and stall reasons: exec dependency is relatively high.

Any time you see a sm_efficiency that is that low, a possible conclusion is that not enough parallel work has been exposed to the GPU, and so neither compute throughput nor memory are the limiting factors, but instead latency is the limiting factor to performance.

This is consistent with the analysis-driven optimization logic:

http://www.nvidia.com/content/GTC-2010/pdfs/2012_GTC2010.pdf (slide 46 and beyond)

and can be rectified simply by exposing more work to the GPU.

The first two optimization priorities for any CUDA programmer are to expose enough parallel work to the GPU, and to make efficient use of the memory subsystem(s).