Advanced NVIDIA CUDA Kernel Optimization Techniques: Handwritten PTX

jwitsoe · July 2, 2025, 8:43pm

Originally published at: Advanced NVIDIA CUDA Kernel Optimization Techniques: Handwritten PTX | NVIDIA Technical Blog

As accelerated computing continues to drive application performance in all areas of AI and scientific computing, there’s a renewed interest in GPU optimization techniques to ensure applications obtain the best possible performance. As an application developer, there are many ways to program GPUs, up and down the software stack. In this post, we introduce some…

huqingqing0309 · August 27, 2025, 9:29am

I have a doubt. If I implement a top-k operator with k = 2 using the following code, I observe that its PTX is nearly identical to hand-written PTX. I believe the performance benefit described in the paper arises from the specialized algorithm for top-k when k = 2, rather than from the manually written PTX。
So I don’t understand what the real benefit of handwritten ptx is? Thank you in advance for your answers.

cuda code：

    float a, b, c, o1, o2;
    a = A[i];
    b = A[i + 1];
    c = C[i];
    float mx = max(b, c);
    bool p = a >= c;
    o1 = p ? mx : a;
    o2 = p ? a : c; 
    A[i] = o1;
    A[i+1] = o2;

it‘s ptx code：

	max.f32 	%f4, %f2, %f3;
	setp.ge.f32 	%p1, %f1, %f3;
	selp.f32 	%f5, %f4, %f1, %p1;
	selp.f32 	%f6, %f1, %f3, %p1;

chun.lin.yang.yang · November 10, 2025, 10:09am

Does DeepSeek use the handwritten PTX to improve its performance? If so, how does they implement?

Topic		Replies	Views
Examining the generated .ptx file CUDA Programming and Performance	13	2637	October 24, 2014
What is the reason why performance deteriorates when PTX code written with pipeline considerations is repeatedly used? CUDA Programming and Performance	4	427	April 28, 2023
Problems with hand-made PTX and driver API Difficulty getting a simple hand-written PTX program to w CUDA Programming and Performance	13	3353	October 12, 2011
Example code using PTX CUDA Programming and Performance	6	9092	March 25, 2008
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	27107	January 19, 2009
Ptxas slow CUDA Programming and Performance cuda , kernel	35	2983	May 2, 2024
C vs PTX CUDA Programming and Performance	3	2720	August 18, 2021
Understanding PTX, the Assembly Language of CUDA GPU Computing Technical Blog	2	132	August 17, 2025
CUDA/PTX Emulator Would Anyone Be Interested? CUDA Programming and Performance	22	9974	June 25, 2013
Crowd sourcing request: help me time the PTX ISA. CUDA Programming and Performance	8	2019	July 2, 2019

Advanced NVIDIA CUDA Kernel Optimization Techniques: Handwritten PTX

Related topics