What is the reason why performance deteriorates when PTX code written with pipeline considerations is repeatedly used?

kindongsy · April 28, 2023, 3:04am

Hello, I’m a student who is working on a study through NVIDIA GPU.
The PTX output by the existing compiler was rewritten in consideration of the pipeline and implemented as an online ptx.
If using this repeatedly, performance will be degraded, so I ask you this question.

[Environment]

RTX 3090 (ampere)
Visual Studio 2022
CUDA Runtime 12.1
CUDA C language

An instruction can be divided into several stages, and if the next instruction is unprocessed until one instruction is processed, the part that processes the other steps during the processing of a particular step of the instruction is not done. At this time, the use of pipelines can improve speed by allowing different steps of one command to be processed while other parts can be processed.

First of all, it was confirmed that the code that applied the pipeline was nearly three times higher than the existing PTX.

However, it was confirmed that reusing it significantly reduces performance and eventually slows down compared to the existing compiler code
I wonder why this is. (I could see the compiler code repeating the same format)

Attached is the code implemented with the existing cuda c and the code implemented with the inline ptx.

[CUDA C implementation]

{ h = (h + 0x428a2f98d728ae22 + w0_t); h = (h + ((((e) >> (14)) | ((e) << (64 - (14)))) ^ (((e) >> (18)) | ((e) << (64 - (18)))) ^ (((e) >> (41)) | ((e) << (64 - (41))))) + ((g) ^ ((e) & ((f) ^ (g)))));
		d += h;
		h = (h + (((((a) >> (28)) | ((a) << (64 - (28))))) ^ (((a) >> (34)) | ((a) << (64 - (34)))) ^ (((a) >> (39)) | ((a) << (64 - (39))))) + (((a) & (b)) | ((c) & ((a) ^ (b))))); }

[PTX extracted through Compiler]

		"add.u64			rd0, h, 0x428a2f98d728ae22;			\n\t"
		
		"xor.b64			rd6, f, g;			\n\t"
		
		"shl.b64			lhs, e, 50;			\n\t"
		"shr.b64			rhs, e, 14;			\n\t"
		"add.u64			rd1, lhs, rhs;			\n\t"
		
		"shl.b64			lhs, e, 46;			\n\t"
		"shr.b64			rhs, e, 18;			\n\t"
		"add.u64			rd2, lhs, rhs;			\n\t"
		
		"and.b64			rd7, e, rd6;			\n\t"
		
		"shl.b64			lhs, e, 23;			\n\t"
		"shr.b64			rhs, e, 41;			\n\t"
		"add.u64			rd4, lhs, rhs;			\n\t"
		
		"xor.b64			rd3, rd1, rd2;			\n\t"
		"xor.b64			rd8, g, rd7;			\n\t"
		"add.u64			h, w0_t, rd0;			\n\t"

		
		"shl.b64			lhs, a, 36;			\n\t"
		"shr.b64			rhs, a, 28;			\n\t"
		"xor.b64			rd5, rd3, rd4;			\n\t"
		"add.u64			rd1, lhs, rhs;			\n\t"
		
		
		"shl.b64			lhs, a, 30;			\n\t"
		"shr.b64			rhs, a, 34;			\n\t"
		"add.u64			rd2, lhs, rhs;			\n\t"
		
		"add.u64			rd9, rd5, rd8;			\n\t"
		"shl.b64			lhs, a, 25;			\n\t"
		"shr.b64			rhs, a, 39;			\n\t"
		"xor.b64			rd3, rd1, rd2;			\n\t"
		"add.u64			rd4, lhs, rhs;			\n\t"
		"xor.b64			rd6, a, b;			\n\t"
		"and.b64			rd8, a, b;			\n\t"
		"add.u64			h, h, rd9;			\n\t"
		"xor.b64			rd5, rd3, rd4;			\n\t"
		"and.b64			rd7, c, rd6;			\n\t"
		"add.u64			d, d, h;			\n\t"
		"or.b64				rd9, rd7, rd8;			\n\t"
		"add.u64			rd10, rd5, rd9;			\n\t"
		"add.u64			h, h, rd10;			\n\t"

[My inline ptx code considering pipeline]

"add.u64			rd0, h, 0x428a2f98d728ae22;			\n\t"

		"xor.b64			rd6, f, g;			\n\t"

		"shl.b64			lhs0, e, 50;			\n\t"
		"shr.b64			rhs0, e, 14;			\n\t"
		"add.u64			rd1, lhs0, rhs0;			\n\t"

		"shl.b64			lhs1, e, 46;			\n\t"
		"shr.b64			rhs1, e, 18;			\n\t"
		"add.u64			rd2, lhs1, rhs1;			\n\t"

		"and.b64			rd7, e, rd6;			\n\t"

		"shl.b64			lhs2, e, 23;			\n\t"
		"shr.b64			rhs2, e, 41;			\n\t"
		"add.u64			rd4, lhs2, rhs2;			\n\t"

		"xor.b64			rd3, rd1, rd2;			\n\t"
		"xor.b64			rd8, g, rd7;			\n\t"
		"add.u64			h, w0_t, rd0;			\n\t"


		"shl.b64			lhs3, a, 36;			\n\t"
		"shr.b64			rhs3, a, 28;			\n\t"
		"xor.b64			rd5, rd3, rd4;			\n\t"
		"add.u64			rd1, lhs3, rhs3;			\n\t"


		"shl.b64			lhs4, a, 30;			\n\t"
		"shr.b64			rhs4, a, 34;			\n\t"
		"add.u64			rd2, lhs4, rhs4;			\n\t"

		"add.u64			rd9, rd5, rd8;			\n\t"
		"shl.b64			lhs5, a, 25;			\n\t"
		"shr.b64			rhs5, a, 39;			\n\t"
		"xor.b64			rd3, rd1, rd2;			\n\t"
		"add.u64			rd4, lhs5, rhs5;			\n\t"
		"xor.b64			rd6, a, b;			\n\t"
		"and.b64			rd8, a, b;			\n\t"
		"add.u64			h, h, rd9;			\n\t"
		"xor.b64			rd5, rd3, rd4;			\n\t"
		"and.b64			rd7, c, rd6;			\n\t"
		"add.u64			d, d, h;			\n\t"
		"or.b64				rd9, rd7, rd8;			\n\t"
		"add.u64			rd10, rd5, rd9;			\n\t"
		"add.u64			h, h, rd10;			\n\t"

[repeated CUDA C implementation]

		{ h = (h + 0x428a2f98d728ae22 + w0_t);
          h = (h + ((((e) >> (14)) | ((e) << (64 - (14)))) ^ (((e) >> (18)) | ((e) << (64 - (18)))) ^ (((e) >> (41)) | ((e) << (64 - (41))))) + ((g) ^ ((e) & ((f) ^ (g)))));
		  d += h;
		  h = (h + (((((a) >> (28)) | ((a) << (64 - (28))))) ^ (((a) >> (34)) | ((a) << (64 - (34)))) ^ (((a) >> (39)) | ((a) << (64 - (39))))) + (((a) & (b)) | ((c) & ((a) ^ (b))))); }

		{ g = (g + 0x7137449123ef65cd + w1_t);
          g = (g + ((((d) >> (14)) | ((d) << (64 - (14)))) ^ (((d) >> (18)) | ((d) << (64 - (18)))) ^ (((d) >> (41)) | ((d) << (64 - (41))))) + ((f) ^ ((d) & ((e) ^ (f)))));
		c += g;
		g = (g + (((((h) >> (28)) | ((h) << (64 - (28))))) ^ (((h) >> (34)) | ((h) << (64 - (34)))) ^ (((h) >> (39)) | ((h) << (64 - (39))))) + (((h) & (a)) | ((b) & ((h) ^ (a))))); }

		{ f = (f + 0xb5c0fbcfec4d3b2f + w2_t);
		  f = (f + ((((c) >> (14)) | ((c) << (64 - (14)))) ^ (((c) >> (18)) | ((c) << (64 - (18)))) ^ (((c) >> (41)) | ((c) << (64 - (41))))) + ((e) ^ ((c) & ((d) ^ (e)))));
		  b += f;
		  f = (f + (((((g) >> (28)) | ((g) << (64 - (28))))) ^ (((g) >> (34)) | ((g) << (64 - (34)))) ^ (((g) >> (39)) | ((g) << (64 - (39))))) + (((g) & (h)) | ((a) & ((g) ^ (h))))); }

		{ e = (e + 0xe9b5dba58189dbbc + w3_t);
		  e = (e + ((((b) >> (14)) | ((b) << (64 - (14)))) ^ (((b) >> (18)) | ((b) << (64 - (18)))) ^ (((b) >> (41)) | ((b) << (64 - (41))))) + ((d) ^ ((b) & ((c) ^ (d)))));
		  a += e;
		  e = (e + (((((f) >> (28)) | ((f) << (64 - (28))))) ^ (((f) >> (34)) | ((f) << (64 - (34)))) ^ (((f) >> (39)) | ((f) << (64 - (39))))) + (((f) & (g)) | ((h) & ((f) ^ (g))))); }

[repeated My inline ptx code considering pipeline]

"add.u64			rd0, h, 0x428a2f98d728ae22;			\n\t"

		"xor.b64			rd6, f, g;			\n\t"

		"shl.b64			lhs0, e, 50;			\n\t"
		"shr.b64			rhs0, e, 14;			\n\t"
		"add.u64			rd1, lhs0, rhs0;			\n\t"

		"shl.b64			lhs1, e, 46;			\n\t"
		"shr.b64			rhs1, e, 18;			\n\t"
		"add.u64			rd2, lhs1, rhs1;			\n\t"

		"and.b64			rd7, e, rd6;			\n\t"

		"shl.b64			lhs2, e, 23;			\n\t"
		"shr.b64			rhs2, e, 41;			\n\t"
		"add.u64			rd4, lhs2, rhs2;			\n\t"

		"xor.b64			rd3, rd1, rd2;			\n\t"
		"xor.b64			rd8, g, rd7;			\n\t"
		"add.u64			h, w0_t, rd0;			\n\t"


		"shl.b64			lhs3, a, 36;			\n\t"
		"shr.b64			rhs3, a, 28;			\n\t"
		"xor.b64			rd5, rd3, rd4;			\n\t"
		"add.u64			rd1, lhs3, rhs3;			\n\t"


		"shl.b64			lhs4, a, 30;			\n\t"
		"shr.b64			rhs4, a, 34;			\n\t"
		"add.u64			rd2, lhs4, rhs4;			\n\t"

		"add.u64			rd9, rd5, rd8;			\n\t"
		"shl.b64			lhs5, a, 25;			\n\t"
		"shr.b64			rhs5, a, 39;			\n\t"
		"xor.b64			rd3, rd1, rd2;			\n\t"
		"add.u64			rd4, lhs5, rhs5;			\n\t"
		"xor.b64			rd6, a, b;			\n\t"
		"and.b64			rd8, a, b;			\n\t"
		"add.u64			h, h, rd9;			\n\t"
		"xor.b64			rd5, rd3, rd4;			\n\t"
		"and.b64			rd7, c, rd6;			\n\t"
		"add.u64			d, d, h;			\n\t"
		"or.b64				rd9, rd7, rd8;			\n\t"
		"add.u64			rd10, rd5, rd9;			\n\t"
		"add.u64			h, h, rd10;			\n\t"

"add.u64			rd20, g, 0x7137449123ef65cd;			\n\t"

		"xor.b64			rd26, e, f;			\n\t"

		"shl.b64			lhs10, d, 50;			\n\t"
		"shr.b64			rhs10, d, 14;			\n\t"
		"add.u64			rd21, lhs10, rhs10;			\n\t"

		"shl.b64			lhs11, d, 46;			\n\t"
		"shr.b64			rhs11, d, 18;			\n\t"
		"add.u64			rd22, lhs11, rhs11;			\n\t"

		"and.b64			rd27, d, rd26;			\n\t"

		"shl.b64			lhs12, d, 23;			\n\t"
		"shr.b64			rhs12, d, 41;			\n\t"
		"add.u64			rd24, lhs12, rhs12;			\n\t"

		"xor.b64			rd23, rd21, rd22;			\n\t"
		"xor.b64			rd28, f, rd27;			\n\t"
		"add.u64			g, w1_t, rd20;			\n\t"


		"shl.b64			lhs13, h, 36;			\n\t"
		"shr.b64			rhs13, h, 28;			\n\t"
		"xor.b64			rd25, rd23, rd24;			\n\t"
		"add.u64			rd21, lhs13, rhs13;			\n\t"


		"shl.b64			lhs14, h, 30;			\n\t"
		"shr.b64			rhs14, h, 34;			\n\t"
		"add.u64			rd22, lhs14, rhs14;			\n\t"

		"add.u64			rd29, rd25, rd28;			\n\t"
		"shl.b64			lhs15, h, 25;			\n\t"
		"shr.b64			rhs15, h, 39;			\n\t"
		"xor.b64			rd23, rd21, rd22;			\n\t"
		"add.u64			rd24, lhs15, rhs15;			\n\t"
		"xor.b64			rd26, h, a;			\n\t"
		"and.b64			rd28, h, a;			\n\t"
		"add.u64			g, g, rd29;			\n\t"
		"xor.b64			rd25, rd23, rd24;			\n\t"
		"and.b64			rd27, b, rd26;			\n\t"
		"add.u64			c, c, g;			\n\t"
		"or.b64				rd29, rd27, rd28;			\n\t"
		"add.u64			rd30, rd25, rd29;			\n\t"
		"add.u64			g, g, rd30;			\n\t"

"add.u64			rd40, f, 0xb5c0fbcfec4d3b2f;			\n\t"

		"xor.b64			rd46, d, e;			\n\t"

		"shl.b64			lhs20, c, 50;			\n\t"
		"shr.b64			rhs20, c, 14;			\n\t"
		"add.u64			rd41, lhs20, rhs20;			\n\t"

		"shl.b64			lhs21, c, 46;			\n\t"
		"shr.b64			rhs21, c, 18;			\n\t"
		"add.u64			rd42, lhs21, rhs21;			\n\t"

		"and.b64			rd47, c, rd46;			\n\t"

		"shl.b64			lhs22, c, 23;			\n\t"
		"shr.b64			rhs22, c, 41;			\n\t"
		"add.u64			rd44, lhs22, rhs22;			\n\t"

		"xor.b64			rd43, rd41, rd42;			\n\t"
		"xor.b64			rd48, e, rd47;			\n\t"
		"add.u64			f, w2_t, rd40;			\n\t"


		"shl.b64			lhs23, g, 36;			\n\t"
		"shr.b64			rhs23, g, 28;			\n\t"
		"xor.b64			rd45, rd43, rd44;			\n\t"
		"add.u64			rd41, lhs23, rhs23;			\n\t"


		"shl.b64			lhs24, g, 30;			\n\t"
		"shr.b64			rhs24, g, 34;			\n\t"
		"add.u64			rd42, lhs24, rhs24;			\n\t"

		"add.u64			rd49, rd45, rd48;			\n\t"
		"shl.b64			lhs25, g, 25;			\n\t"
		"shr.b64			rhs25, g, 39;			\n\t"
		"xor.b64			rd43, rd41, rd42;			\n\t"
		"add.u64			rd44, lhs25, rhs25;			\n\t"
		"xor.b64			rd46, g, h;			\n\t"
		"and.b64			rd48, g, h;			\n\t"
		"add.u64			f, f, rd49;			\n\t"
		"xor.b64			rd45, rd43, rd44;			\n\t"
		"and.b64			rd47, a, rd46;			\n\t"
		"add.u64			b, b, f;			\n\t"
		"or.b64				rd49, rd47, rd48;			\n\t"
		"add.u64			rd50, rd45, rd49;			\n\t"
		"add.u64			f, f, rd50;			\n\t"

"add.u64			rd60, e, 0xe9b5dba58189dbbc;			\n\t"

		"xor.b64			rd66, c, d;			\n\t"

		"shl.b64			lhs30, b, 50;			\n\t"
		"shr.b64			rhs30, b, 14;			\n\t"
		"add.u64			rd61, lhs30, rhs30;			\n\t"

		"shl.b64			lhs31, b, 46;			\n\t"
		"shr.b64			rhs31, b, 18;			\n\t"
		"add.u64			rd62, lhs31, rhs31;			\n\t"

		"and.b64			rd67, b, rd66;			\n\t"

		"shl.b64			lhs32, b, 23;			\n\t"
		"shr.b64			rhs32, b, 41;			\n\t"
		"add.u64			rd64, lhs32, rhs32;			\n\t"

		"xor.b64			rd63, rd61, rd62;			\n\t"
		"xor.b64			rd68, d, rd67;			\n\t"
		"add.u64			e, w3_t, rd60;			\n\t"


		"shl.b64			lhs33, f, 36;			\n\t"
		"shr.b64			rhs33, f, 28;			\n\t"
		"xor.b64			rd65, rd63, rd64;			\n\t"
		"add.u64			rd61, lhs33, rhs33;			\n\t"


		"shl.b64			lhs34, f, 30;			\n\t"
		"shr.b64			rhs34, f, 34;			\n\t"
		"add.u64			rd62, lhs34, rhs34;			\n\t"

		"add.u64			rd69, rd65, rd68;			\n\t"
		"shl.b64			lhs35, f, 25;			\n\t"
		"shr.b64			rhs35, f, 39;			\n\t"
		"xor.b64			rd63, rd61, rd62;			\n\t"
		"add.u64			rd64, lhs35, rhs35;			\n\t"
		"xor.b64			rd66, f, g;			\n\t"
		"and.b64			rd68, f, g;			\n\t"
		"add.u64			e, e, rd69;			\n\t"
		"xor.b64			rd65, rd63, rd64;			\n\t"
		"and.b64			rd67, h, rd66;			\n\t"
		"add.u64			a, a, e;			\n\t"
		"or.b64				rd69, rd67, rd68;			\n\t"
		"add.u64			rd70, rd65, rd69;			\n\t"
		"add.u64			e, e, rd70;			\n\t"

For your information, the measured performance is the average value of 130,000 times of operation.
In addition, if the experiment is repeated four times, the values in the register are all different.

The performance I want to know is the number of times the algorithm is performed per second.(number of outputs/seconds)

Please reply.

Thank you.

njuffa · April 28, 2023, 3:49am

Please don’t post code as pictures. To create a proper code block with syntax highlighting, insert the code text and precede and follow it with lines consisting of three backticks: ```

It is unclear what you are measuring. I have investigated numerous code generation issues in the CUDA toolchain over the years and I have no idea what “PTX with pipeline” might mean. For issues related to code generation it is best to post a minimal self-contained example that others can cut&paste, build, run, and profile.

Performance investigations relating to code generation must be based on the code that is actually executed by the GPU. That is machine code, better known as SASS. For SASS from offline compilation, use cuobjdump --dump-sass to disassemble SASS.

ptxas in both its offline and online incarnations is an optimizing compiler that translates PTX into SASS. PTX serves as both a virtual ISA and a compiler intermediate representation. Except for very limited cases that are specifically concerned with the interaction between the nvvm compiler that produces PTX and the ptxas compiler that consumes it, basing performance investigations on an analysis of PTX code is futile.

Because ptxas is an optimizing compiler, performing code rewrites at the PTX level is about as effective as performing code rewrites at the HLL level, that is, hardly at all. Even if such a re-write delivers positive change, that success is very brittle and could easily disappear or even turn into something counterproductive with the next version of the toolchain.

kindongsy · April 28, 2023, 5:53am

Thank you for your answer.
Based on what you said, I will post the question again.

Thank you.

njuffa · April 28, 2023, 6:07am

It would probably be useful to profile the various code variants you are investigating and look for any significant differences in profiler metrics regarding memory throughput, instruction throughput, stalls, etc.

kindongsy · April 28, 2023, 6:26am

Thank you for your reply. I am planning to check what you told me through nsight compute visual study edition.

Topic		Replies	Views
Different output of code when not unrolling loop CUDA Programming and Performance	16	1073	August 22, 2022
How to tell nvcc that some `if` must diverge and stop trying to fuse previous statements into it? CUDA Programming and Performance	20	455	March 3, 2024
Crowd sourcing request: help me time the PTX ISA. CUDA Programming and Performance	8	1897	July 2, 2019
CUDA 5.5 produces different PTX code from CUDA 5.0 with degraded performance CUDA Programming and Performance	3	1437	August 15, 2013
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8616	December 18, 2008
coalescing struct loading problem CUDA Programming and Performance	21	12710	March 5, 2010
Ptxas compiler speed. CUDA Programming and Performance	23	12118	December 20, 2012
Cuda compiler will optimize code to use more registers than available by attempting to cache parameters CUDA Programming and Performance	12	2280	November 14, 2017
Getting Performance on Titan Legacy PGI Compilers	12	11809	December 27, 2016
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3229	January 20, 2019

What is the reason why performance deteriorates when PTX code written with pipeline considerations is repeatedly used?

Related topics