why CUDA 2.0 does not expose all PTX ISA 1.3 ?

Hi !

I’m wondering if anyone knows why CUDA 2.0 does not expose all assembler instructions available in ISA 1.3 via built-in analogs ?

By inspecting ptx1.3 reference manual I found out that there is a number of new instructions, that is,

vote, addc (addition with carry) and red (parallel reduction)…

While the former one have built-in analogs in C (__all and __any), the latter ones
seem not be accessible from the high-level interface …

I found this limitation rather discouraging, especially, addc instruction could be really useful to implement extended-precision arithmetic on CUDA (however currently there is no access to carry flag from C)

Any ideas ?

I think it could be useful, if a next version of NVCC would contain an asm{}-like statement, such as

ptx {



ptx volatile {


to inline, directly in C code, ptx istructions (with or without the permission to nvcc to do alterations).

yes, this would be indeed useful

anyway I do not understand this politics of NVIDIA - NVCC should evolve accordingly to PTX, otherwise new features are just not available, the only way to make use of them is to manually edit ptx files (is a big headacke) and load cubins through driver API …


As far as I know, “ptx inlining” is mentioned and requested for more than a year (probably 1.5 year since CUDA went to public), and this is such an extremely useful feature if anyone (including myself) wants to do some optimization and uses some ptx-only instructions. I know there’s some potential problem when dealing with inline ptx, such as…when you want to load some value from a variable and the memory addressing space (register/local) for that variable is determined in the CUDA compilation process, then what would should we put in ptx? “ld.local.f32 d, [a]” or “mov.f32 d, a”? (note: a variable could be put into local memory instead of register space for some reason…)

So I guess a more important issue to solve before giving out inline ptx is to add additional addressing specifier for any variable/pointer. For ex:

__device__ someVarInGlobalMemory;

__device__ someDummyFunc(__constant__ int A, __constant__ B, __shared__ float[] C)


    // here we don't rely on the current auto-memory-addr-space solver in nvcc but use our own specifier for parameter C (for here we know parameter C is in shared memory for all other calls from global function)



__global__ someDummyKernel(__constant__ int A, __constant__ int B, __constant__ float[] C)


    // all parameters in global function are in constant memory space, so maybe here it's not required to put __constant__ before A,B,C)

    __shared__ float cache[...];

    __reg__ float x; // here we force variable x to reside in register space instead of going to local memory



        ld.const.f32 x, C[threadIdx.x];

        st.shared.f32 cache[threadIdx.x], x;



   someDummyFunc(A, B, cache);


But the above approach may result in some kind-of conflict in CUDA language semantics especially when some nvidia guy claim that CUDA code can be re-targeted to multi-core CPU where there’s no shared/local/global/constant memory design in CPU, and there’s no ptx instruction either. (Why not just abandon the mutli-core CPU plan considering there’s already Intel TBB in place?)

Just some random thoughts on inline ptx…any comment/input is welcome!

Wow. That makes perfect sense. That’s why we can’t have inline PTX.

(There’s talk of also putting CUDA on ATI cards and making it a general standard… you really can’t let CUDA code inclue PTX. Not to mention, it may not even be compatible with future NVIDIA chips. Do you really want to put the x86 zombie curse on GPUs? Let’s not let our code get too architecture-specific, so that 25 years from now we don’t find ourselves backward compatible with obsolete bs.)

EDIT: still… I want my add-with-carry!

well, in contrast to x86, PTX is not a real assembly because some mnemonics are mapped to several native gpu instructions (for instance 32-bit integer multiplication). So ptxas has some freedom to “reinterprete” the ptx code according to the actual architecture.

In my opinion that would be nice to have smth like an open-source CUDA-like compiler which takes all the advantages of underlying architecture and interested people can contribute their ideas. But I am not sure if this is doable at all…

The vote instructions are exposed via the any() and all() intrinsic functions (see programming guide section 4.4.5).

‘someone’ could always take GCC frontend, do own backend that compiles c/c++ to PTX, and then use ptxas to do cubin, and driver api to load and lunch kernels … the real pain in the ass :/

EDIT: The useful things in CUDA that REALY should be added are:
0] !!! some switch that forbids compiller using local memory !!! ;)
1] tex2DLod to sample textures with mipmaps (and filtering)
2] support for sampling from DXT compressed textures
3] full support for 3d textures
4] some debugging support (not the cudaemu, but ‘real’ debug support, or at least
more informative error why kernel failed, not just ‘unknown error’ message :))
for example: Kernel XXX PageFault at address XXX block XXX thread XXX
(and if possible line number from ptx or cu file), with this information debugging would
be easier.

yes, that’s what I mentioned in my first post …

it is also interesting why PTX has ‘addc’ instruction but does not have ‘subr’ or subtraction is implemented via addition ?

good idea, maybe provide a backend for LLVM compiler …

Really, you don’t need Driver API. Research the code repository. In the Runtime API framework, you have a folder next to your exe where all the cubins go. Change a cubin in the filesystem, and your code a has a new kernel on the fly.

Yeah, but why ptx{} has to be portable?

why I can’t use portable-CUDA on multi-core CPU if I want to write portable code,

and use machine-dependent code to optimize my kernel to run faster on my fantastic Nvidia card (directly in the .cu file)?


__shared__ ...

__regs__ ...

__yuuuhuuuu!!__ ...

__cchiupilupittuttiii__ ...

__other_cool_stuff_from_nvidia_devices__ ...

    ptx { 




   ... portable (slow) CUDA code ...


Yeah, that sounds sweet. BUT… it’s really complicated by the memory space issue. A pointer isn’t just a pointer is a pointer. Some C-level pointers are to global mem, some to shared mem, and a lot of the time it’s pretty hard for even nvcc to tell which is which (i’m sure you’ve all seen the “can’t figure out this pointer, assuming gmem” messages). But perhaps… LLVM/clang/gcc already has some facility for this if some other architectures also have the problem? You need support on two levels: the front-end to figure out the relationships (hard) and the backend to be able to process this metadata and emit the right instructions.

I wonder what other similar gotchas there are. (And this is really dumb… because NVIDIA should have just carved up the 32bit address space so as to not have this problem.)

Because the machine-dependent code would barely be faster. You’d be wasting your own time, and stalling future adoption of better tech. Remember, one reason that GPUs are so much faster than CPU is not because they started out that way! It’s because every generation engineers got to reinvent themselves and realize better and better ideas. The cards didn’t even solidify when the first programmable chips came out. Profound changes occured as recently as the G80 generation, when gpus switched from SIMD to SIMT.

because this would have worked well when the host + device memory approached or exceeded 4 GB?

So they delayed the inevitable by one generation. And at what cost?

for the same reason, we cannot create a portable language for HPC!

I think you cannot write portable and efficient code in the same time…

Just have a complete set of intrinsics functions like MMX/SSE2 does and part of the PTX already has. __addc and direct control of setp would be much appreciated

seems PTX ISA is publicly available, that is, open64 compiler, look at here:


any volanteers to add new intrinsics to nvcc ? ;)