Programming CUDA at 'assembler' level?

Forgive my limited knowledge. I studied assembler on the ARM microprocessor but only wrote a few programs on a simulator for it.

I have seen you write CUDA in C/C++ projects (like on VS2008) but is it possible to write programs at ‘assembler’ level for CUDA and how would you do this?

Forgive my limited knowledge. I studied assembler on the ARM microprocessor but only wrote a few programs on a simulator for it.

I have seen you write CUDA in C/C++ projects (like on VS2008) but is it possible to write programs at ‘assembler’ level for CUDA and how would you do this?

cuda assembler is called “PTX ISA”. a reference manual can be found here: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf

as to where to write it; the software to assemble it, i would presume there’d be some kind of assembler directive or code block in cuda (.cu), but i don’t know what that would be.

cuda assembler is called “PTX ISA”. a reference manual can be found here: http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf

as to where to write it; the software to assemble it, i would presume there’d be some kind of assembler directive or code block in cuda (.cu), but i don’t know what that would be.

PTX is not real assembler actually. It’s intermediate language from high level C-like to hardware-level ISA. And using some PTX instructions doesn’t means it’ll be translated into real hardware instructions “as is” (like 32-bit multiplications for SM 1.0-1.3). Generally it simply pointless to use PTX as NVIDIA’s C compiler is capable of producing very good code by its own.

PTX is not real assembler actually. It’s intermediate language from high level C-like to hardware-level ISA. And using some PTX instructions doesn’t means it’ll be translated into real hardware instructions “as is” (like 32-bit multiplications for SM 1.0-1.3). Generally it simply pointless to use PTX as NVIDIA’s C compiler is capable of producing very good code by its own.

The same could be said for any assembly language.

The same could be said for any assembly language.

Although the tools mostly don’t work with Fermi, for the compute capability 1.x devices you can use things like cudasm (comes with decuda) to produce cubin files. That gets you much closer to the hardware than PTX.

The biggest reason why “assembly” programming CUDA devices is not very useful (unless you are writing a new compiler) is the lack of detailed hardware-level documentation. NVIDIA does not (and is almost certainly never going to) release hardware documentation at the level of Intel and AMD’s architecture optimization guides. Without that information, you are going to have a very hard time doing better than the authors of nvcc and ptxas, who do have access to such internal documents. Otherwise, you are stuck basically coding up your preconceived notion of how you think the hardware works, or doing a very large number of microbenchmarks to try to deduce the rules of the game, which will change with the next generation of card.

Although the tools mostly don’t work with Fermi, for the compute capability 1.x devices you can use things like cudasm (comes with decuda) to produce cubin files. That gets you much closer to the hardware than PTX.

The biggest reason why “assembly” programming CUDA devices is not very useful (unless you are writing a new compiler) is the lack of detailed hardware-level documentation. NVIDIA does not (and is almost certainly never going to) release hardware documentation at the level of Intel and AMD’s architecture optimization guides. Without that information, you are going to have a very hard time doing better than the authors of nvcc and ptxas, who do have access to such internal documents. Otherwise, you are stuck basically coding up your preconceived notion of how you think the hardware works, or doing a very large number of microbenchmarks to try to deduce the rules of the game, which will change with the next generation of card.