Lower Level CUDA NVasc


The CUDA webpage lists “Direct driver and assembly level access through CUDA for research and language development” as a CUDA technology feature. “Assembly for computing” (NVasc) is also referred to in the document, “NVIDIA GeForce 8800 GPU Architecture Overview”.

I would very much like to obtain further information on the program. Is this possible?


I’m casting another vote for the release of such information.

It’s not too hard to infer the basic assembler instructions CUDA is capable of. Look at the produced .ptx file. It also contains the C code in comments. Very handy for looking for optimization potential :yes:

But for completeness of reference, I would also like to see the full assembler spec.


Yeah, we would also like at least a basic listing of the opcodes and their function.
Although PTX is far easier to read and interpret through intuition than most CPU instruction sets, having a real reference would be a great help, particularly when chasing down cases where the optimizer is making a bad choice.

John Stone

It might even help if the forum search feature would let you search for “ptx” – but it doesn’t. :thumbsdown:

As NVIDIA promised - the PTX and ISA manual is part of the 1.0 release. Look in the $(CUDA)/doc directory.


Yes but it is not an assembler in the true sense - infact far from it. PTX is a intermediate language that is not that close to the hardware at all. For instance there is no mention of the carry flag in the PTX spec… So one can’t build wide accumulators using all the bits in the architecture.

I would advise one not to try using it direct as the only patterns that Nvidia have tested are those generated by nvopencc, and even then, since the code base for CUDA is so small (compared with a common C compiler) there are a lot of bugs. Now just look at the relative size - ptxa’s text segment is more than FIVE times the size of nvopencc’s… simple arithmetic will tell you how many bugs you can expect!

I have said before that we really need to be able to see (ie human readable) what ptxas is generating to have any chance of tracking down difficult bugs in the toolchain. Most especially to figure out where ptxas is not allocating registers minimally.


PS there are a lot of bugs fixed in 1.0, and they were fixed quickly.

Where? I get the following doc directory in 1.0:

It’s in the toolkit, not SDK, doc directory. The toolkit is installed in c:\CUDA by default.


I agree with Eric on the “true” assembly.
Also, a specification of what ptxas actually optimizes would help greatly. For instance, carrying floating point arithmetic in an exactly specified order is essential for some geometry algorithms, replacing mul and add with a mad, or ( a + b ) + c with a + ( b + c ), may totally crash the algorithm. With ptx’s spec out, it’s now possible to bypass nvcc’s optimizations, but ptxas still does something behind the back, and we still can’t be sure whether our operations are carried out in the desired order.

I think the documentation does say somewhere that the driver does some final optimization (that’s also the reason why the very first invocation of a kernel may take slightly longer than subsequent ones, you can certainly see that being taken into consideration in the timing of SDK samples).

As far as maintaining the order of operations is concerned, I’m not sure that even the mainstream CPUs from Intel or AMD guarantee that. You can force the compiler not to rearrange the operations, but the CPU may execute instructions (and microinstructions) out of order it it deams it beneficial to performance (just check VTune listings done by execution).

If an algorithms is critically sensitive to the operation order, it is probably not stable to be practical. There are many issues with the IEEE754 standard precision, some of which can be addressed by presorting your data.


The CPU may execute instructions out of order indeed, but that’s only when they’re independent. As long as the assembly is correct, we always got correct result.
Also, the solely purpose of the whole operation order thing is usually just to improve stability without losing too much performance. For example, when one translates a geometric primitive and its bounding box into another coordinate system, one usually want the bounding box to remain valid. If the operations are carried out in exactly similar manner (as written by the programmer), this is always true. But when the program meets an over-enthusiastic optimizing compiler, after some loop unrolling, *+ to mad, CSE and stuff, this may no longer hold. One would have to recompute the bounding box or add some annoying epsilon to keep the bounding box valid. Even worse, when the vertices of a single primitive are translated in different manner (even slight), it may become degenerate, and that would force us to put uber-costly degeneracy check everywhere. I don’t think this fall into the category of “not stable to be practical”.
That’s what I once encountered in nvcc (I did the loop unrolling by hand, otherwise nvcc stores everything in local memory). Currently I’m able to get around this using volatile shared memory (nvcc still can’t compile my program with -Xopencc -O0), but I’d like to be sure about whether I’m going to meet similar problems in ptxas, or the driver (not very likely, I guess, but I still want to be sure).
I know compiler guys really love to do aggressive optimizations, but the 1st UIUC course of CUDA did mention the importance of correctness, right?

By the way, what do you mean by presorting?

I, too, would like more direct access to the low level. At least to know the microcode that is generated, isn’t there some debug option that prints it like for Cg?

I think we will never get to see the proper ALU code as this might reveal hardware details NVIDIA would not want to be public. Personally I feel that the PTX is sufficiently low level. Most of the time I would not want to dive in to that level of detail anyway. But it is necessary to have to see what nvopencc has screwed up lately :/

wumpus, Cg translates to ARBfp or DX assembler opcodes, which is also a VM model. So you’ll have to work for NVIDIA to see the proper GPU ASIC binaries. Go ahead, apply for a job… :D


A “Personally I feel that the PTX is sufficiently low level.” from an NVIDIA is sufficiently satisfying to me:)
Thanks a lot, I’ll go work on the ptx.

I am not working for NVIDIA.


I personally am very pleased to see the PTX manual included in the CUDA distribution. The idea stated there that it could become an industry standard of sorts could make it an interesting third-party compiler target. Taken together with its more abstract/virtual nature, this might also maintain its long-term stability.

The inclusion of atomic instructions in 0.9/1.0 is also great news. The PTX level is low enough for me too.

There’s always one more thing though, no? Could we please have some PTX example projects included in a future CUDA release?


Personally I think the futures of GPUs and CPUs are converging, and just like Intel publishes every detail on the assembly language of their processors, I’m quite sure that eventually, NVidia will do the same. CUDA is part of this process.

About applying for a job at NVidia, heh, that would certainly be interesting place to work.

CUDA is a big step in the direction of GP massively parallel programming on chip. Given we are talking about General Purpose much more complex patterns are going to be generated by users than the algorithms used in graphics land. This is a new area for Nvidia and I believe there needs to be a culture change at Nvidia for their CUDA to be successful. The current secretive approach to protection of IP at this level has to change or they will get overrun in the market by AMD, Intel &/or IBM. A lot of information can be deduced from having the hardware in your hands, given time and smart engineering - and the aforementioned companies have plenty of resources so really it is the smaller developers that suffer from lack of availability of information.

The PTX manual is a great step in the right direction.

Just had another example where I wasted hours because ptxas is generating wrong code (the PTX is fine), so bad it caused a system lockup requiring a power off to reset. Takes one back to early M$ NT days. This one is obvious but when there is a subtle problem it could be much worse.


Paul: sorry for mistaking you for nvidia guy :(

Found another problem today, ptxas seems to do instruction scheduling before register allocation.
That’s really inconvenient sometimes…