Ptxas slow

Dear CUDA developers,
We want to create extreme big kernels for a new kind of AI library.

We read the ptxas compiler flags and the advices on the forum about it without success.
We are able to go very low level, so we can generate the ptx code also for our code with ease with our very minimal compiler. But we are essentially stucked. As we couldn’t go to lower level like SASS.
But to reach SASS or generate the .cubin files we have to use the ptxas.

We realized that we are actually stucked due to we cannot generate kernels even at the size of 400_000 instruction set enough fast (around 38 seconds), and it scales linearly to extremely long ptxas build. And the goal would be to literally generate 100_000_000 instructions kernels around 0.1 sec. So we need like 500x speedup, which means essentially nothing just reading the code and mapping the instructions to the hardware specific code with specific addresses directly calculated.

Are there any single way to say to ptxas to don’t optimize anything, just generate the .cubin code from ptx without any optimization?
Any help would be appreciated. Please let us know what is the maximum we can do here. We are open to any idea, we just have to solve this issue!

Have you tried -O0?

Compiling 100_000_000 instructions kernels in around 0.1 sec sounds … ambitious, to say the least. Likely outside the realm of possibility [*]. When you compile with ptxas -O0, what kind of compilation speeds (lines / second) are you seeing on your build system? I am assuming your build system is an x86 system with extremely high single-thread performance? Available data from SPEC suggests that a platform based on either Intel Xeon E-2488 or AMD Ryzen 9 7950X might be a good choice.

Am I correct to assume that you are talking about JIT-compilation, i.e. the PTX is being produced and fed to the ptxas-equivalent part of the CUDA driver) at system memory speeds without ever touching a file system?


[*] Compiling one GPU kernel is a single-threaded activity. On a very fast modern x86 processor a single core might execute with IPC=3 at 5 GHz, or 15e9 instructions per second. Processing 100M lines of PTX code in 0.1 second equates to processing 1e9 lines of PTX code in one second. Can one process a line of PTX code in a mere 15 instructions? No.

Probably the fastest compiler that ever existed was Borland’s Pascal compiler, a single-pass compiler designed for maximum compilation speed from the start. It could process Pascal code at about the same LOC/second rate as Borland’s x86 assembler (a separate product written by a different author)! Borland Pascal eventually became the basis for a product called Delphi by the early 1990s. Some years back, someone ran Delphi in a 16-bit VM on modern x86-64 hardware and observed a throughput in excess of 1 MLOC/second. With the fastest available x86 machines today one might be looking at 2.0 - 2.5 MLOC/second.

Given that ptxas is a true compiler with a multi-stage design, it seems highly unlikely that it could ever achieve let alone exceed this kind of speed. In addition, modern GPUs task this backend compiler with additional work beyond just traditional code generation and register allocation. Each generated SASS instruction comprises an 8-byte instruction encoding plus an associated 8-byte op-steering control block.

Have you tried -O0?

Yes. Also many more flag from here: NVIDIA CUDA Compiler Driver Even tried explicitly write out some just to make sure the flags are really set correctly. But with many tries, we could only get like 10-30% speed-ups all in all over the -O0.


[*] Compiling one GPU kernel is a single-threaded activity. On a very fast modern x86 processor a single core might execute with IPC=3 at 5 GHz, or 15e9 instructions per second. Processing 100M lines of PTX code in 0.1 second equates to processing 1e9 lines of PTX code in one second. Can one process a line of PTX code in a mere 15 instructions? No.

Probably the fastest compiler that ever existed was Borland’s Pascal compiler, a single-pass compiler designed for maximum compilation speed from the start. It could process Pascal code at about the same LOC/second rate as Borland’s x86 assembler (a separate product written by a different author)! Borland Pascal eventually became the basis for a product called Delphi by the early 1990s. Some years back, someone ran Delphi in a 16-bit VM on modern x86-64 hardware and observed a throughput in excess of 1 MLOC/second. With the fastest available x86 machines today one might be looking at 2.0 - 2.5 MLOC/second.

Your estimates are pretty close by the way on single core. We know 100_000_000 instructions to be compiled in 0.1sec is too ambitious, it would mean basically a “mapping from array and a very small calculation(1-3)” (also dict are already like 5-10x slower so it would alone give 0.3s.
Yes, our main server has AMD Ryzen 9 7950X3D.

I understand that we are overambitious, it would be okay to have that big kernel to be compiled in more seconds, we are still talking about serious speed-up.

In this case if ptxas cannot be improved further.
Do I see well, the only remaining option is to write our own basic SASS compiler?

Do you know, is there any chance that somehow write our own SASS compiler, which is much more basic (with around 20-30 instructions) that we need for our problem set?
Have someone tried a minimal dummy version of it for a single (latest) architecture?
I saw there are very minor documentation on this topic, I guess a lot of work would be reverse engineer things, which is extremely risky and hard.

Some people, notably Scott Gray, reverse engineered the SASS code for the Maxwell architecture, if I recall correctly. To my recollection, he reverse engineered it enough for his purposes, but not completely, and then wrote a SASS assembler (not a PTX to SASS compiler) based on this information. See GitHub repository.

But the hardware ISA changes with every major GPU architecture (may be less so in recent years as there seems to be some convergence among designs), and that is particularly true for the “hidden” control block that accompanies each instruction. My understanding is that properly filling in the control block is critical for both correct execution and for performance. I have no insight into how complex the algorithms are for filling in this information.

I would think it is a tall order to attempt to do this without the detailed hardware documentation that NVIDIA keeps very close to their chest, and then do it again for each new GPU architecture. So my take on creating a ptxas replacement is (with apologies to Knuth):

  1. Don’t do it.
  2. (Experts only) Don’t do it yet.

If you find specific scenarios where ptxas is unexpectedly slow, you could file enhancement requests with NVIDIA, of course. These so called RFEs can be filed through the bug reporting system. Simply prefix the synopsis with "RFE: " so it is easy to see that this is an enhancement request rather than a functional bug.

Perhaps you could compile/assemble small PTX kernels with ptxas and somehow try to fuse the SASS code - most SASS instructions can execute at any (aligned) memory address. To consider: Initialization code in the beginning is needed only once, absolute addresses for jumps have to be relocated, registers should not be overwritten.

PTX can be written in a modular/structured fashion. There is nothing that prevents you from writing ptx functions, and calling those functions from a main kernel routine, and compiling those functions separately, and linking them together. Sure that doesn’t solve the problem if you have 400000 lines of ptx to compile for the first time, but if you have any sort of code reuse, structure, or modularity, then perhaps not everything needs to be recompiled every time.

These things are extremly helpful!

Sadly small ptx kernels aren’t an option due to the use case we want to create. We build a graph like calculation flow from kernels, so essentially we hardcode the calculation into a kernel. This allows us to build 1000 or even million times more efficient computations when it comes to ML models. Also new architectures can emerge from this, as we see which is more and more efficient. That’s why the unstructured graph like computation we cannot do modular/structured small kernels to get the final big one. Also to achieve this we have to build these very big networks that are million times more efficient than the dense version in many cases.

Really thank you guys for all the help! I think with these help we can already start working on this.
Also we would be happy if our work could be public by time as Scott Gray’s work.

Recommended or not to work on a new SASS compiler. I totally understand the advices. Looking into these things just propose infinite possible mistakes that we cannot avoid beforehand due to “we do not know what we do not know” situation. We will consider do it double times for sure.

I have knowledge of three different industry projects (each in a different application area) that build custom processing pipelines for the GPU on the fly using PTX JIT compilation, and they were all designed in a modular fashion. All of them dynamically re-structure their processing pipeline in real-time based on user input, so it is not immediately clear to me why that could not work in your case.

Obviously you have detailed knowledge of your use case and I do not, but as a general design issue (in both hardware and software) monolithic designs tend to run into hard problems when trying to scale up by orders of magnitude, while modular approaches often make that easier.

I understand (or I hope I understand) why you think it is possible to do our work in a modular fashion.

Actually we have a modular way, we made a test on this, but it is limited with memory bound. So, there is a way to describe the whole computation with pointers to pointers. So basically get the graph each point with the “calculation that is pointed by the node” on the input vectors and write it into the vector that keeps the “node” value. Also there are many thing that can be optimized here. So it was a really big project to do this test actually, so batching operands and layers and many more.

So, it is possible to do the same but losing a lot of speed due to the memory bound limits. So the “network/graph” we want to calculate uses too many pointer aritmetics, which eventually bring us to a memory bound kernel. For our case we got literally 100x or 200x slower speeds than the theoretical speed due to the memory bounds.
I don’t know how could we make this faster yet. Of course it would be nice to spare some of the writes and reads but as of my knowledge I don’t think it is possible this way.
Even if it is fast, our mindset is “max or nothing”.

Ofc. For the other type of problem there are numerous libraries so.

All in all, instead using vector to store the instructions, we could lay down the whole instruction space into a kernel directly, map registers and spare memory reads just by using the kernel direct instructions.

Or maybe my knowledge is limited. I can never be sure.
Thank you for your advices!

It is difficult to advise on code one has not seen. I do not know what organization you work for, if any. If it is a well-known one, and the project is of some importance, maybe try to get in touch with NVIDIA’s customer-facing engineers at a GPU Developer Conference. That organization within NVIDIA (commonly known as DevTech) includes some engineers with deep insights into and extensive experience with optimizing memory-bound code.

My other standard piece of advice is to keep searching through the latest publications and pre-prints (e.g. arXiv) to see whether anybody is working on something similar. In my experience, even if one is working on cutting-edge technology, there are typically two or three other people (or teams) working on similar technology.

Depending on how desperate you are, and given that you are already using the CPU with the highest reported performance in the gcc portion of SPEC CPU, you could try overclocking your build system for an incremental performance boost. This is popular with hobbyists (gamers) but also done professionally in the high-frequency trading segment of the financial world. There used to be a Danish company called Asetek that offered systems with active refrigeration (using a miniature compressor in the bottom of a super-sized tower) for people who always need to be ahead of the general performance curve. The machines were very expensive but did the job. Not sure whether the company is still in business, it has been 20 years since I last looked into this.

Another angle to research would be to see whether there is any ARM64-based platform available that is supported for CUDA and that could offer potentially faster compilation times, but possibly is not reported for SPEC CPU (yet). My impression of the ARM64 market is that the focus is more on increasing core count rather than boosting single-core performance, but I don’t claim to have extensive knowledge of the ARM64 market.

Hi havlik,
the main issue with very large kernels is that they are larger than the instruction caches. This leads to additional memory accesses and stalls waiting for the instructions to arrive.

A memory-bound kernel means that you have enough compute resources left, even to reshuffle data between registers and within shared memory or run code in a more flexible way.

For code with lots of pointers and nodes it is sometimes difficult to even out the workload on 32 lanes within a warp. E.g. linked lists can have different lengths. Also memory accesses are often not coalesced.

Often such code potentially needs too much data in each step of the algorithm, greater than the size of shared memory.

But if in your case you can solve this with a longer specifically generated kernel, it means that the amount of data is limited and can stay localized. This means that probably a modular kernel without pointers would work.

Just guessing without knowing more.

Best,
Sebastian

I understand that you would need more information to give more help how could we formulate the problem to be modular.

Overclocking is a very slight improvement that’s my problem and I need extremly fast compilation compared to the current one. 20-30-50% doesn’t count. We need like 100x if not even 1000x speedup from this stage, which would be reasonable only if we could minimize the calculation required for each instruction to be converted to SASS to like 30-200 instructions.

In our team there are people who wrote really performant code on some problems already, that’s why we were discussing that it actually could be possible if our knowledge is correct about the compilation step what the PTXAS does.

Also thank you for the ARM60 based platform tip. Also we didn’t even check the AMD, indeed. We will look into these for sure!

Getting back to the point. ptxas is slow and currently it is extremely many work to write something similar. I would say, Nvidia should start to be more open with ptxas, as if they would open that source or anything that helps to write alternative ptxas, that actually would boost the CUDA ecosystem and could give a lot of benefit to the world.

Thank you for the great points also!

Instruction caches definitely a large problem. The advantage we could get is that, we wouldn’t have to use pointer to “map the instructions” over there, so we could spare really crazy amount of read and we could also get a more or like coalesced read from the instructions. Based on our test where we didn’t use too much instruction to map one small problem that was described with vectors, we could get 100-200x speedups.

On the instruction cache size, I saw there are limited information, but I think it doesn’t matter that much as we will definitely run out of that small memory (I guess)… so we will/have to waste the memory bounds on these too. So we accept that I guess.

Yes, we have 100 or even 1000 times more compute resource left than when we work with arrays to describe each of the instructions. Hmm… This way I never wrote it. So we could have an array that just instruction’s pointers that will be ran… I will check this version, I don’t know if this can be used in some way yet.

It is crazy we were testing these things a lot and still gets a lot of new idea from you guys! I really thank you and appreciate all of your help!!

Sorry to rain on your parade: Based on the scant information supplied, that is just not going to happen with an approach based on PTXAS JIT compilation any time soon. Probably not even within the next decade: Moore’s Law is dead, meaning single-thread performance will be largely stagnant; improvements from compilers are in the low single digit percent per year, see Proebsting’s Law. You may want to look into an FPGA-based approach instead of using GPUs. I last worked with FPGAs in the 1990s on algorithms that did not run fast enough on processors any which way people tried.

I think you are correct in yoru assessment of instruction cache misses in CUDA kernels. I do not know where GPU ICache size is these days (8KB?), but in my experience I have never seen ICache misses affect performance by more than 3%.

The Dissecting the NVidia Turing T4 GPU via Microbenchmarking paper (https://arxiv.org/pdf/1903.07486.pdf) shows on page 26 how just two threads (within different warps, for the last two charts within different SMs) can influence each other, even when only executing long strings of NOP operations.

In the extreme case (when both instruction streams are read from global memory), the computation slows down by 10x. Not sure how far further it would go (if speed goes further down at all) with more than one warp. More warps could help to hide latency (faster) or share bandwidth (slower).

The measured sizes were L0 (scheduler=SM Partition=Processing Block local): 16 KB, L1 (SM local): 46 KB, afterwards it goes into the L2 shared with data accesses and then into global memory.

One SASS instruction has a size of 16 bytes.

A general ptxas assembler probably won’t and can’t be sped up by 100x or 1000x. But if you just want to customize some values in the machine code, e.g. register number, offsets, the order of (simple) operations, then you could quickly generate a SASS program. It also would not have to be done single-threaded. Me and others have manually created/or modified existing SASS machine words, even in sequence, lots of times for testing purposes.

The SASS code is a bit more complicated than many other machine languages, as it also encodes dependencies and waiting states. But those values could be chosen conservatively by you.

One actual problem is that SASS is not officially documented by Nvidia. And any inofficial documentation is either not accepted by some industry regulations to be used within products (especially safety critical products), not legal to use depending on local law (e.g. if it involved reverse engineering it is allowed in some countries only, or could necessitate two isolated teams with one reverse engineering, the other implementing the assembler), and/or not guaranteed by Nvidia in any way, even within the same architecture generation, but always prone to change.

Basically this would be a small interpreter running on the GPU, which itself possibly still fits into the instruction caches (at least the L1 one with about 4000 SASS instructions).

Then the task is to stream the instructions fast enough. Hopefully all your warps execute the same kernel. You could also use much smaller instruction words than the 16 bytes of SASS. Or even possibly compress and unpack instructions (in a parallel fashion, separately on each SM).

The warps within a SM can stream the instruction stream together either with L1 cache or with shared memory (or distributed shared memory within thread block clusters, if you have access to Hopper H100 GPUs).

Depending on whether to optimize instruction word size or interpretation speed, you would use more complex specific instructions (e.g. with a jump table) or more arithmetic instructions, where the instruction word contains immediate data like offsets, etc. Immediate data could also be read with a separate stream (then it is basically just normal data). Examples for existing flexible, more arithmetic leaning SASS/PTX instructions with interesting immediates are the LOP3 (PTX ISA 8.3) and PRMT (PTX ISA 8.3). So you could create instructions with a similar style. Lots of different operations can be calculated with single instructions. If with those more flexible instructions, you overall need less different instructions, your interpreter could potentially be implemented in a much smaller kernel, thus better fitting into L0 and L1 instruction caches.

Another property of the instruction caches and large kernels, is that they are only slow, if you either just serially execute the large kernel or if you randomly jump around too much. So your interpreted program can have different phases, with different interpreters in the kernel for each phase. If you synchronize the warps within a SM so that they enter new phases together, the new interpreter with special routines is only loaded into the cache (for each phase) once at the time of phase transition.

A general ptxas assembler probably won’t and can’t be sped up by 100x or 1000x. But if you just want to customize some values in the machine code, e.g. register number, offsets, the order of (simple) operations, then you could quickly generate a SASS program. It also would not have to be done single-threaded. Me and others have manually created/or modified existing SASS machine words, even in sequence, lots of times for testing purposes.

General ptxas assembler won’t be that fast you are right!
Modification can be done multithreaded, indeed!

And any inofficial documentation is either not accepted by some industry regulations to be used within products (especially safety critical products), not legal to use depending on local law

I didn’t know that it isn’t accepted by regulations to be used in product. This is actually terrible for us. I think we have to make query specifically for Nvidia in this case or move to another idea indeed. I gues our hope should be AMD in this case if we start to work to create our own compiler (sadly, I have no previous knowledge on the build process there).

Yes, you are right and the things you described is great base for me.
Then we have to try to work on this small compiler on GPU kernel first and see if we can create a speed that worth to work on.

For me I have the only concern with this version that, I don’t know how could I spare instructions and use registry/local/shared memory instead of global memory. So maybe I could create a compiler that convert the operations to 3 or 4 bit and figure out that way. I am a little bit disappointed I cannot hide sorry. So with generating the whole million long instruction sequence could have been so much easier that I cannot even explain it in short.
Ok. Sorry to be this long. But is it possible that we would have wasted memory bandwidth anyways as we had to read the ICache anyways like million times? Because if that is the case then we also could get an upper speed limit as we could only load 64 bit/instruction, so eventually 1008.0 GB/s (Nvidia 4090) which is 126 billion instruction/sec not calculating with the memorybandwidth required by the instruction. So that way we still couldn’t get too close to the 80.000billion instruction/sec (80TFLOPS). Do I calculate correctly in this scenario?

Hi havlikmarcell,
whether a not supported and not documented way to use Nvidia is allowed, depends on the industry, e.g. critical could be automotive, aerospace, space, nuclear power, critical industrial processes, military, medical technology, …
For those there are development standards manufacturers and suppliers have to adhere too.
But also depending on how safety critical a specific component is.

You were talking about AI in your original post.
AI could be used to detect people walking in front of a self-driving car or to generate an artful painting for hanging on the wall. There could be very different requirements, what is allowed or accepted. Also whether the function depends on it or if there are other safeguards.

The idea with AMD is (as far as I understand) to speed up to compiler by a small factor (perhaps x2?). Not sure, if that helps you, if you want 100x or 1000x.
What you described, I would either a) try to modularize the kernel, b) go with the GPU interpreter solution or c) modify or assemble a subset of SASS instructions in a multithreaded way.

BTW @njuffa is it possible to create SASS with the GPU directly in global GPU memory and afterwards run it there? Or does each kernel always have to be uploaded from the CPU?