Verify cuda core peak fp32 performance

i want to verify cuda core peak fp32 throughput on rtx3090, eg. 32-bit floatingpoint add is 128 on sm86.(cuda c programming guide 12.3 chapter 8.4.1).

from white paper we know boost clock is 1.7G, so i set cuda core clock to 1700Mhz.
nvidia-smi --lock-gpu-clocks=1700

3090 fp32 add through is 128(per sm) * 82 (sm_number) * 1.7 /1000=17.8 TFLOPS. (fma need multiply 2 get 35.6, same as 3090 spec).

i write a program like below, to verify cuda core fp32 add through:

global void gpu_compute_peak(float* in, float* out, long long int *clock, unsigned int *loop_time)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;

if(idx < NUM_ELEMENTS) {
float input = in[idx];
unsigned int loop = loop_time[idx];
long long int start, end;
unsigned int i=0;
asm(“.reg .u32 t1;\n\t”
“.reg .f32 f1;\n\t”
“.reg .f32 f2;\n\t”
“mov.u32 t1, %3;\n\t”
“mov.f32 f1, %4;\n\t”
“mov.u64 %1, %%clock64;\n\t”
“mov.f32 f2, 0f3F000000;\n\t”
“$__LOOP:\n\t”
“add.u32 t1, t1, 8;\n\t”
“add.f32 f1, f1, 0f3F000000;\n\t”
“add.f32 f1, f1, 0f3F800000;\n\t”
“add.f32 f1, f1, 0f3F000000;\n\t”
“add.f32 f1, f1, 0f3F800000;\n\t”
“add.f32 f1, f1, 0f3F000000;\n\t”
“add.f32 f1, f1, 0f3F800000;\n\t”
“add.f32 f1, f1, 0f3F000000;\n\t”
“add.f32 f1, f1, 0f3F800000;\n\t”
“setp.ne.s32 %%p1, t1, %5;\n\t”
“@%%p1 bra $__LOOP;\n\t”
“mov.u64 %2, %%clock64;\n\t”
“mov.f32 %0, f1;\n\t”
:“=f”(input), “=l”(start), “=l”(end): “r”(i), “f”(input), “r”(loop)
);
clock[idx] = end - start;
out[idx] = input;
}
}

loop_time always 0x10000, NUM_ELEMENTS equal thread number which is 10496.

loop_time * number_thread / clock delta * boost clock /1000

i get 4.2 TFLOPS, which is far less than 17.8. i don’t know where i get wrong, can some one help me ? attachment is the test code.

many thanks.

perf_test.txt (4.0 KB)

Perhaps mix different float registers instead of using only f1?

What is the stall reason according to the profiler?

I don’ know the reason, dump sass seems correct. Do anybody have idea?

When you properly utilize the GPU, i.e. use 82 * 2048 threads instead of 82 * 128, you get the correct values.

threadsPerBlock: 128
blocksPerGrid: 1312
event: time: 0.73216ms
event: 15032.034137 G elements per second
clock: total clock: 99542356896 each thread clock: 592739.834794
clock: each clock 18567.764557 number
clock: 31565.199747 G elements per second

@striker159 many thanks.

I will try it later but i can’t figure out why. Can you give me some explanation? Maybe i need to read corresponding section.

GPU RTX 3090 is SM8.6.

The primary issue is that the test code is launching 1 thread block/SM which results in only 1 warp per SM sub-partition. The test sequence is a sequence of dependent FADDs. The dependent instruction issue latency for FADD is 4 cycles so each SM sub-partition only has 1 warp that can issue an instruction every 4 cycles. The two options to improve the throughput are:

  1. Increase the warps per SM sub-partition to at least 4 to cover dependent latency.
  2. Increase the warp instruction level parallelism by avoiding a sequence of dependent instructions.

SM8.6 supports up to 1536 threads/SM. As stated above as you increase to 512 threads/SM (as 1-4 thread blocks) the performance will increase. The method of measuring elapsed cycles would need to be changed for handing more threads per block or multiple blocks per SM.

From godbolt.org NVCC 12.3.1 -arch=sm_86

gpu_compute_peak(float*, float*, long long*, unsigned int*):
MOV R1, c[0x0][0x28] 
S2R R0, SR_TID.X 
S2R R3, SR_CTAID.X 
IMAD R0, R3, c[0x0][0x0], R0 
ISETP.GT.AND P0, PT, R0, 0x28ff, PT 
@P0 EXIT 
MOV R5, 0x4 
ULDC.64 UR4, c[0x0][0x118] 
IMAD.WIDE R2, R0, R5, c[0x0][0x160] 
IMAD.WIDE R8, R0, R5, c[0x0][0x178] 
LDG.E R11, [R2.64] 
LDG.E R6, [R8.64] 
MOV R7, RZ 

// 64-bit clock read
CS2R R2, SR_CLOCKLO
ISETP.GT.AND P0, PT, R6, RZ, PT 
BSSY B1, `(.L_x_0) 
BSSY B0, `(.L_x_1) 
@!P0 BRA `(.L_x_2) 
IADD3 R4, R6, -R7, RZ 
BSSY B2, `(.L_x_3) 
PLOP3.LUT P0, PT, PT, PT, PT, 0x80, 0x0 
ISETP.GT.AND P1, PT, R4, 0x18, PT 
@!P1 BRA `(.L_x_4) 
PLOP3.LUT P0, PT, PT, PT, PT, 0x8, 0x0 
IADD3 R4, R6, -0x18, RZ 
.L_x_5:

// initial start of sequence
FADD R11, R11, 0.5 
IADD3 R7, R7, 0x20, RZ 
FADD R11, R11, 1 
ISETP.GE.AND P1, PT, R7, R4, PT 

// sequence of dependent instructions
FADD R11, R11, 0.5 
FADD R11, R11, 1 
FADD R11, R11, 0.5 
FADD R11, R11, 1 
FADD R11, R11, 0.5 
FADD R11, R11, 1 
FADD R11, R11, 0.5 
FADD R11, R11, 1
1 Like

A standard technique used in microbenchmarks is to simply use multiple sequences of dependent instructions. Here that would mean summing into multiple accumulators in round-robin fashion. When using CUDA C++ code rather than PTX inline assembly, one would want to sum across the accumulators at the end and store the result to global memory to prevent the compiler from removing the code during dead-code elimination.

WoW, thans @Greg @njuffa .
i change threadsPerBlock to 1284 and NUM_ELEMENTS to 104964, seems correct.

threadsPerBlock: 512
blocksPerGrid: 82
event: time: 0.18432ms
event: 14927.644218 G elements per second
clock: total clock: 12604116704 each thread clock: 300212.383384
clock: each clock 9165.056394 number
clock: 15580.595870 G elements per second

please figure out if i am wroung, i will dig into this and update later.

If all input operands are registers, latency is caused by register dependencies, i.e., some of the input operands are written by some previous instruction(s) whose execution has not completed yet. In this case, the latency is equal to the execution time of the previous instruction and the warp schedulers must schedule instructions of other warps during that time. Execution time varies depending on the instruction. On devices of compute capability 7.x, for most arithmetic instructions, it is typically 4 clock cycles. This means that 16 active warps per multiprocessor (4 cycles, 4 warp schedulers) are required to hide arithmetic instruction latencies (assuming that warps execute instructions with maximum throughput, otherwise fewer warps are needed). If the individual warps exhibit instruction-level parallelism, i.e. have multiple independent instructions in their instruction stream, fewer warps are needed because multiple independent instructions from a single warp can be issued back to back.

seem cuda c programming guide have mentioned above.
Thanks for all your helps.

The PTX language machine model has infinite registers. Normally you would try to rewrite registers as seldom as possible, but put results into new registers instead. The only exceptions (I can think of now) are loops and conditional assignments. It is the job of the assembler to consider the number of available registers and register latency. Depending on the GPU architecture the registers are also stored into banks (+ an additional register reuse cache). Using register operands from the same bank has an additional penalty.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.