I have created three synthetic CUDA kernels, which are almost all doing only arithmetic operations. All three kernels are the same, except each of them does a different number of operations. Kernel #1 does 8 operations, Kernel #2 does 16 operations and Kernel #3 does 32. Here are the implementations of CUDA kernel for all three.

Kernel #1:

```
#ifndef kernelWGSXMAPIXLLXOPS8_H_
#define kernelWGSXMAPIXLLXOPS8_H_
__global__ void WGSXMAPIXLLXOPS8 (const float *GIn, float *GOut, const float M, const float N, const float P) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
float MF = (float) M;
float NF = (float) N;
float PF = (float) P;
for (int lcdd = 0; lcdd < 1; lcdd++) {
float temp1 = 1.0;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
... // 8 FMA operations
temp1 = temp1 * MF + temp1;
GOut[gid] = temp1;
}
}
void WGSXMAPIXLLXOPS8_wrapper (const float *GIn, float *GOut,
const float M, const float N, const float P,
int numBlocks, int threadPerBlock) {
WGSXMAPIXLLXOPS8<<<numBlocks, threadPerBlock>>> (GIn, GOut, M, N, P);
}
#endif
```

Kernel #2:

```
#ifndef kernelWGSXMAPIXLLXOPS16_H_
#define kernelWGSXMAPIXLLXOPS16_H_
__global__ void WGSXMAPIXLLXOPS16 (const float *GIn, float *GOut, const float M, const float N, const float P) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
float MF = (float) M;
float NF = (float) N;
float PF = (float) P;
for (int lcdd = 0; lcdd < 1; lcdd++) {
float temp1 = 1.0;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
... // 16 FMA operations
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
GOut[gid] = temp1;
}
}
void WGSXMAPIXLLXOPS16_wrapper (const float *GIn, float *GOut,
const float M, const float N, const float P,
int numBlocks, int threadPerBlock) {
WGSXMAPIXLLXOPS16<<<numBlocks, threadPerBlock>>> (GIn, GOut, M, N, P);
}
#endif
```

Kernel #3:

```
#ifndef kernelWGSXMAPIXLLXOPS32_H_
#define kernelWGSXMAPIXLLXOPS32_H_
__global__ void WGSXMAPIXLLXOPS32 (const float *GIn, float *GOut, const float M, const float N, const float P) {
int gid = blockIdx.x * blockDim.x + threadIdx.x;
float MF = (float) M;
float NF = (float) N;
float PF = (float) P;
for (int lcdd = 0; lcdd < 1; lcdd++) {
float temp1 = 1.0;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
```

```
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
... // 32 FMA operations
temp1 = temp1 * MF + temp1;
temp1 = temp1 * MF + temp1;
GOut[gid] = temp1;
}
}
void WGSXMAPIXLLXOPS32_wrapper (const float *GIn, float *GOut,
const float M, const float N, const float P,
int numBlocks, int threadPerBlock) {
WGSXMAPIXLLXOPS32<<<numBlocks, threadPerBlock>>> (GIn, GOut, M, N, P);
}
#endif
```

The total number of threads have been set to 16384, and block size is 256. I have calculated the total GFlops of each of these kernels and are equal to 20.44, 56.53, and 110.12 GFlops. I was trying to come up with an explanation, but nothing comes to my mind. So I tried using nvprof and monitored all metrics. All metrics are almost equal, Here are some of the metrics that look important to me (I also included results for kernel 1 to 3):

```
sm_efficiency_instance: 14.99, 16.78, 19.82 %
ipc_instance: 0.57 , 0.93 , 1.53
inst_replay_overhead: 0.399, 0.268, 0.165
dram_write_throughput: 18.08, 17.72, 16.9 GB/s
issued_ipc: 0.99 , 1.18 , 1.52
issue_slot_utilization: 19.48, 24.64, 33.76 %
stall_exec_dependency: 21.84, 26.38, 42.95 %
```

As it’s clear, both of them have the same dram_write_throughput, since all are writing same amount of data to the DRAM, and the total number of threads is the same. What I don’t understand is sm_efficiency. My kernels are all doing arithmetics (the same), how come their sm_efficiency is not the same. Also, why having more arithmetic in the same kernel increases the efficiency? My understanding is, all of them should have the same problem for finding warps to locate on SM.

Can anyone help me to understand the difference of GFlops, using below metrics?