Advanced NVIDIA CUDA Kernel Optimization Techniques: Handwritten PTX

Originally published at: Advanced NVIDIA CUDA Kernel Optimization Techniques: Handwritten PTX | NVIDIA Technical Blog

As accelerated computing continues to drive application performance in all areas of AI and scientific computing, there’s a renewed interest in GPU optimization techniques to ensure applications obtain the best possible performance. As an application developer, there are many ways to program GPUs, up and down the software stack. In this post, we introduce some…

I have a doubt. If I implement a top-k operator with k = 2 using the following code, I observe that its PTX is nearly identical to hand-written PTX. I believe the performance benefit described in the paper arises from the specialized algorithm for top-k when k = 2, rather than from the manually written PTX。
So I don’t understand what the real benefit of handwritten ptx is? Thank you in advance for your answers.

cuda code:

    float a, b, c, o1, o2;
    a = A[i];
    b = A[i + 1];
    c = C[i];
    float mx = max(b, c);
    bool p = a >= c;
    o1 = p ? mx : a;
    o2 = p ? a : c; 
    A[i] = o1;
    A[i+1] = o2;

it‘s ptx code:

	max.f32 	%f4, %f2, %f3;
	setp.ge.f32 	%p1, %f1, %f3;
	selp.f32 	%f5, %f4, %f1, %p1;
	selp.f32 	%f6, %f1, %f3, %p1;

Does DeepSeek use the handwritten PTX to improve its performance? If so, how does they implement?