why CUDA 2.0 does not expose all PTX ISA 1.3 ?

asm · October 20, 2008, 11:01am

Hi !

I’m wondering if anyone knows why CUDA 2.0 does not expose all assembler instructions available in ISA 1.3 via built-in analogs ?

By inspecting ptx1.3 reference manual I found out that there is a number of new instructions, that is,

vote, addc (addition with carry) and red (parallel reduction)…

While the former one have built-in analogs in C (__all and __any), the latter ones
seem not be accessible from the high-level interface …

I found this limitation rather discouraging, especially, addc instruction could be really useful to implement extended-precision arithmetic on CUDA (however currently there is no access to carry flag from C)

Any ideas ?

samuelmurdoch · October 23, 2008, 10:02am

I think it could be useful, if a next version of NVCC would contain an asm{}-like statement, such as

ptx {
…
}

and

ptx volatile {
…
}

to inline, directly in C code, ptx istructions (with or without the permission to nvcc to do alterations).

asm · October 23, 2008, 2:36pm

yes, this would be indeed useful

anyway I do not understand this politics of NVIDIA - NVCC should evolve accordingly to PTX, otherwise new features are just not available, the only way to make use of them is to manually edit ptx files (is a big headacke) and load cubins through driver API …

alex_dubinsky · October 24, 2008, 4:40am

amen

Mu-Chi_Sung · October 24, 2008, 5:05am

As far as I know, “ptx inlining” is mentioned and requested for more than a year (probably 1.5 year since CUDA went to public), and this is such an extremely useful feature if anyone (including myself) wants to do some optimization and uses some ptx-only instructions. I know there’s some potential problem when dealing with inline ptx, such as…when you want to load some value from a variable and the memory addressing space (register/local) for that variable is determined in the CUDA compilation process, then what would should we put in ptx? “ld.local.f32 d, [a]” or “mov.f32 d, a”? (note: a variable could be put into local memory instead of register space for some reason…)

So I guess a more important issue to solve before giving out inline ptx is to add additional addressing specifier for any variable/pointer. For ex:

__device__ someVarInGlobalMemory;

__device__ someDummyFunc(__constant__ int A, __constant__ B, __shared__ float[] C)

{

    // here we don't rely on the current auto-memory-addr-space solver in nvcc but use our own specifier for parameter C (for here we know parameter C is in shared memory for all other calls from global function)

    // DO SOMETHING USEFUL

}

__global__ someDummyKernel(__constant__ int A, __constant__ int B, __constant__ float[] C)

{

    // all parameters in global function are in constant memory space, so maybe here it's not required to put __constant__ before A,B,C)

    __shared__ float cache[...];

    __reg__ float x; // here we force variable x to reside in register space instead of going to local memory

    ptx

    {

        ld.const.f32 x, C[threadIdx.x];

        st.shared.f32 cache[threadIdx.x], x;

    }

   __syncthreads();

   someDummyFunc(A, B, cache);

}

But the above approach may result in some kind-of conflict in CUDA language semantics especially when some nvidia guy claim that CUDA code can be re-targeted to multi-core CPU where there’s no shared/local/global/constant memory design in CPU, and there’s no ptx instruction either. (Why not just abandon the mutli-core CPU plan considering there’s already Intel TBB in place?)

Just some random thoughts on inline ptx…any comment/input is welcome!

alex_dubinsky · October 24, 2008, 5:45am

Wow. That makes perfect sense. That’s why we can’t have inline PTX.

(There’s talk of also putting CUDA on ATI cards and making it a general standard… you really can’t let CUDA code inclue PTX. Not to mention, it may not even be compatible with future NVIDIA chips. Do you really want to put the x86 zombie curse on GPUs? Let’s not let our code get too architecture-specific, so that 25 years from now we don’t find ourselves backward compatible with obsolete bs.)

EDIT: still… I want my add-with-carry!

asm · October 24, 2008, 11:28am

well, in contrast to x86, PTX is not a real assembly because some mnemonics are mapped to several native gpu instructions (for instance 32-bit integer multiplication). So ptxas has some freedom to “reinterprete” the ptx code according to the actual architecture.

In my opinion that would be nice to have smth like an open-source CUDA-like compiler which takes all the advantages of underlying architecture and interested people can contribute their ideas. But I am not sure if this is doable at all…

Simon_Green · October 24, 2008, 11:36am

The vote instructions are exposed via the any() and all() intrinsic functions (see programming guide section 4.4.5).

DarkAr · October 24, 2008, 11:43am

‘someone’ could always take GCC frontend, do own backend that compiles c/c++ to PTX, and then use ptxas to do cubin, and driver api to load and lunch kernels … the real pain in the ass External Image

EDIT: The useful things in CUDA that REALY should be added are:
0] !!! some switch that forbids compiller using local memory !!! ;)
1] tex2DLod to sample textures with mipmaps (and filtering)
2] support for sampling from DXT compressed textures
3] full support for 3d textures
4] some debugging support (not the cudaemu, but ‘real’ debug support, or at least
more informative error why kernel failed, not just ‘unknown error’ message External Image
for example: Kernel XXX PageFault at address XXX block XXX thread XXX
(and if possible line number from ptx or cu file), with this information debugging would
be easier.

asm · October 24, 2008, 12:35pm

yes, that’s what I mentioned in my first post …

it is also interesting why PTX has ‘addc’ instruction but does not have ‘subr’ or subtraction is implemented via addition ?

asm · October 24, 2008, 12:37pm

good idea, maybe provide a backend for LLVM compiler …

alex_dubinsky · October 24, 2008, 4:17pm

Really, you don’t need Driver API. Research the code repository. In the Runtime API framework, you have a folder next to your exe where all the cubins go. Change a cubin in the filesystem, and your code a has a new kernel on the fly.

samuelmurdoch · October 24, 2008, 4:18pm

Yeah, but why ptx{} has to be portable?

why I can’t use portable-CUDA on multi-core CPU if I want to write portable code,

and use machine-dependent code to optimize my kernel to run faster on my fantastic Nvidia card (directly in the .cu file)?

#ifdef NV_TESLA_ARCH

__shared__ ...

__regs__ ...

__yuuuhuuuu!!__ ...

__cchiupilupittuttiii__ ...

__other_cool_stuff_from_nvidia_devices__ ...

    ptx { 

         .....

    }

#else

   ... portable (slow) CUDA code ...

#endif

alex_dubinsky · October 24, 2008, 4:24pm

Yeah, that sounds sweet. BUT… it’s really complicated by the memory space issue. A pointer isn’t just a pointer is a pointer. Some C-level pointers are to global mem, some to shared mem, and a lot of the time it’s pretty hard for even nvcc to tell which is which (i’m sure you’ve all seen the “can’t figure out this pointer, assuming gmem” messages). But perhaps… LLVM/clang/gcc already has some facility for this if some other architectures also have the problem? You need support on two levels: the front-end to figure out the relationships (hard) and the backend to be able to process this metadata and emit the right instructions.

I wonder what other similar gotchas there are. (And this is really dumb… because NVIDIA should have just carved up the 32bit address space so as to not have this problem.)

alex_dubinsky · October 24, 2008, 4:28pm

Because the machine-dependent code would barely be faster. You’d be wasting your own time, and stalling future adoption of better tech. Remember, one reason that GPUs are so much faster than CPU is not because they started out that way! It’s because every generation engineers got to reinvent themselves and realize better and better ideas. The cards didn’t even solidify when the first programmable chips came out. Profound changes occured as recently as the G80 generation, when gpus switched from SIMD to SIMT.

StickGuy · October 24, 2008, 4:32pm

because this would have worked well when the host + device memory approached or exceeded 4 GB?

alex_dubinsky · October 24, 2008, 4:36pm

So they delayed the inevitable by one generation. And at what cost?

samuelmurdoch · October 25, 2008, 6:35am

for the same reason, we cannot create a portable language for HPC!

I think you cannot write portable and efficient code in the same time…

pstach · October 27, 2008, 7:28am

Just have a complete set of intrinsics functions like MMX/SSE2 does and part of the PTX already has. __addc and direct control of setp would be much appreciated

asm · November 5, 2008, 10:56am

seems PTX ISA is publicly available, that is, open64 compiler, look at here:

NVISA

any volanteers to add new intrinsics to nvcc ? ;)

Topic		Replies	Views
Openness about 'real' cubin instructions CUDA Programming and Performance	27	20380	April 29, 2009
CUDA Toolkit 3.0 released CUDA Programming and Performance	62	26031	September 21, 2010
Going to learn PTX and write a GPU compiler CUDA Programming and Performance	20	26860	January 19, 2009
CUDA and Murphy's Law Some things you may bump into... CUDA Programming and Performance	16	19187	August 21, 2007
CUDA SUCKS!!! Why <block, thread> cannot be judged by itself CUDA Programming and Performance	20	8167	February 17, 2015
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	430101	March 25, 2010
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	210999	January 18, 2014
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
Cubin assembler is now available decuda 0.4.0 released CUDA Programming and Performance	33	22909	May 21, 2009
Determining correct compute capability for a loaded PTX file/kernel ? CUDA Programming and Performance	10	2610	February 11, 2015

why CUDA 2.0 does not expose all PTX ISA 1.3 ?

Related topics