Verify cuda core peak fp32 performance

avater210 · April 17, 2024, 10:29am

i want to verify cuda core peak fp32 throughput on rtx3090, eg. 32-bit floatingpoint add is 128 on sm86.(cuda c programming guide 12.3 chapter 8.4.1).

from white paper we know boost clock is 1.7G, so i set cuda core clock to 1700Mhz.
nvidia-smi --lock-gpu-clocks=1700

3090 fp32 add through is 128(per sm) * 82 (sm_number) * 1.7 /1000=17.8 TFLOPS. (fma need multiply 2 get 35.6, same as 3090 spec).

i write a program like below, to verify cuda core fp32 add through:

global void gpu_compute_peak(float* in, float* out, long long int *clock, unsigned int *loop_time)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;

if(idx < NUM_ELEMENTS) {
float input = in[idx];
unsigned int loop = loop_time[idx];
long long int start, end;
unsigned int i=0;
asm(“.reg .u32 t1;\n\t”
“.reg .f32 f1;\n\t”
“.reg .f32 f2;\n\t”
“mov.u32 t1, %3;\n\t”
“mov.f32 f1, %4;\n\t”
“mov.u64 %1, %%clock64;\n\t”
“mov.f32 f2, 0f3F000000;\n\t”
“$__LOOP:\n\t”
“add.u32 t1, t1, 8;\n\t”
“add.f32 f1, f1, 0f3F000000;\n\t”
“add.f32 f1, f1, 0f3F800000;\n\t”
“add.f32 f1, f1, 0f3F000000;\n\t”
“add.f32 f1, f1, 0f3F800000;\n\t”
“add.f32 f1, f1, 0f3F000000;\n\t”
“add.f32 f1, f1, 0f3F800000;\n\t”
“add.f32 f1, f1, 0f3F000000;\n\t”
“add.f32 f1, f1, 0f3F800000;\n\t”
“setp.ne.s32 %%p1, t1, %5;\n\t”
“@%%p1 bra $__LOOP;\n\t”
“mov.u64 %2, %%clock64;\n\t”
“mov.f32 %0, f1;\n\t”
:“=f”(input), “=l”(start), “=l”(end): “r”(i), “f”(input), “r”(loop)
);
clock[idx] = end - start;
out[idx] = input;
}
}

loop_time always 0x10000, NUM_ELEMENTS equal thread number which is 10496.

loop_time * number_thread / clock delta * boost clock /1000

i get 4.2 TFLOPS， which is far less than 17.8. i don’t know where i get wrong, can some one help me ? attachment is the test code.

many thanks.

perf_test.txt (4.0 KB)

Curefab · April 17, 2024, 11:02am

Perhaps mix different float registers instead of using only f1?

What is the stall reason according to the profiler?

avater210 · April 17, 2024, 12:08pm

I don’ know the reason， dump sass seems correct. Do anybody have idea?

striker159 · April 17, 2024, 12:20pm

When you properly utilize the GPU, i.e. use 82 * 2048 threads instead of 82 * 128, you get the correct values.

threadsPerBlock: 128
blocksPerGrid: 1312
event: time: 0.73216ms
event: 15032.034137 G elements per second
clock: total clock: 99542356896 each thread clock: 592739.834794
clock: each clock 18567.764557 number
clock: 31565.199747 G elements per second

avater210 · April 17, 2024, 12:40pm

@striker159 many thanks.

I will try it later but i can’t figure out why. Can you give me some explanation? Maybe i need to read corresponding section.

Greg · April 17, 2024, 4:04pm

GPU RTX 3090 is SM8.6.

The primary issue is that the test code is launching 1 thread block/SM which results in only 1 warp per SM sub-partition. The test sequence is a sequence of dependent FADDs. The dependent instruction issue latency for FADD is 4 cycles so each SM sub-partition only has 1 warp that can issue an instruction every 4 cycles. The two options to improve the throughput are:

Increase the warps per SM sub-partition to at least 4 to cover dependent latency.
Increase the warp instruction level parallelism by avoiding a sequence of dependent instructions.

SM8.6 supports up to 1536 threads/SM. As stated above as you increase to 512 threads/SM (as 1-4 thread blocks) the performance will increase. The method of measuring elapsed cycles would need to be changed for handing more threads per block or multiple blocks per SM.

From godbolt.org NVCC 12.3.1 -arch=sm_86

gpu_compute_peak(float*, float*, long long*, unsigned int*):
MOV R1, c[0x0][0x28] 
S2R R0, SR_TID.X 
S2R R3, SR_CTAID.X 
IMAD R0, R3, c[0x0][0x0], R0 
ISETP.GT.AND P0, PT, R0, 0x28ff, PT 
@P0 EXIT 
MOV R5, 0x4 
ULDC.64 UR4, c[0x0][0x118] 
IMAD.WIDE R2, R0, R5, c[0x0][0x160] 
IMAD.WIDE R8, R0, R5, c[0x0][0x178] 
LDG.E R11, [R2.64] 
LDG.E R6, [R8.64] 
MOV R7, RZ 

// 64-bit clock read
CS2R R2, SR_CLOCKLO
ISETP.GT.AND P0, PT, R6, RZ, PT 
BSSY B1, `(.L_x_0) 
BSSY B0, `(.L_x_1) 
@!P0 BRA `(.L_x_2) 
IADD3 R4, R6, -R7, RZ 
BSSY B2, `(.L_x_3) 
PLOP3.LUT P0, PT, PT, PT, PT, 0x80, 0x0 
ISETP.GT.AND P1, PT, R4, 0x18, PT 
@!P1 BRA `(.L_x_4) 
PLOP3.LUT P0, PT, PT, PT, PT, 0x8, 0x0 
IADD3 R4, R6, -0x18, RZ 
.L_x_5:

// initial start of sequence
FADD R11, R11, 0.5 
IADD3 R7, R7, 0x20, RZ 
FADD R11, R11, 1 
ISETP.GE.AND P1, PT, R7, R4, PT 

// sequence of dependent instructions
FADD R11, R11, 0.5 
FADD R11, R11, 1 
FADD R11, R11, 0.5 
FADD R11, R11, 1 
FADD R11, R11, 0.5 
FADD R11, R11, 1 
FADD R11, R11, 0.5 
FADD R11, R11, 1

njuffa · April 17, 2024, 6:02pm

A standard technique used in microbenchmarks is to simply use multiple sequences of dependent instructions. Here that would mean summing into multiple accumulators in round-robin fashion. When using CUDA C++ code rather than PTX inline assembly, one would want to sum across the accumulators at the end and store the result to global memory to prevent the compiler from removing the code during dead-code elimination.

avater210 · April 18, 2024, 1:05am

WoW, thans @Greg @njuffa .
i change threadsPerBlock to 1284 and NUM_ELEMENTS to 104964, seems correct.

threadsPerBlock: 512
blocksPerGrid: 82
event: time: 0.18432ms
event: 14927.644218 G elements per second
clock: total clock: 12604116704 each thread clock: 300212.383384
clock: each clock 9165.056394 number
clock: 15580.595870 G elements per second

please figure out if i am wroung, i will dig into this and update later.

avater210 · April 18, 2024, 8:08am

If all input operands are registers, latency is caused by register dependencies, i.e., some of the input operands are written by some previous instruction(s) whose execution has not completed yet. In this case, the latency is equal to the execution time of the previous instruction and the warp schedulers must schedule instructions of other warps during that time. Execution time varies depending on the instruction. On devices of compute capability 7.x, for most arithmetic instructions, it is typically 4 clock cycles. This means that 16 active warps per multiprocessor (4 cycles, 4 warp schedulers) are required to hide arithmetic instruction latencies (assuming that warps execute instructions with maximum throughput, otherwise fewer warps are needed). If the individual warps exhibit instruction-level parallelism, i.e. have multiple independent instructions in their instruction stream, fewer warps are needed because multiple independent instructions from a single warp can be issued back to back.

seem cuda c programming guide have mentioned above.
Thanks for all your helps.

Curefab · April 18, 2024, 8:14am

The PTX language machine model has infinite registers. Normally you would try to rewrite registers as seldom as possible, but put results into new registers instead. The only exceptions (I can think of now) are loops and conditional assignments. It is the job of the assembler to consider the number of available registers and register latency. Depending on the GPU architecture the registers are also stored into banks (+ an additional register reuse cache). Using register operands from the same bank has an additional penalty.

system · May 2, 2024, 8:15am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does %clock measure actual GPU cycles, or what? CUDA Programming and Performance	5	1572	July 9, 2019
Instruction level parallelism and maximum instruction per clock CUDA Programming and Performance	6	1099	September 3, 2019
Mythical Tflops CUDA Programming and Performance	11	1091	January 14, 2019
Yet another performance question CUDA Programming and Performance	16	4228	February 12, 2009
Instruction timings More info than in the guide CUDA Programming and Performance	5	8284	May 21, 2007
Unable to reach full fp32 throughput on Titan X CUDA Programming and Performance	14	1622	December 10, 2016
What limits the IPC in CUDA? or How to decrease the avg execution dependency cycles? CUDA Programming and Performance	6	7164	March 30, 2013
How to optimize my cuda code? CUDA Programming and Performance	14	1773	June 28, 2023
Cuda code performance CUDA Programming and Performance	14	3119	December 16, 2014
CUDA, more threads for same work = Longer run time despite better occupancy, Why? CUDA Programming and Performance	9	6002	March 25, 2010

Verify cuda core peak fp32 performance

Related topics