asm inlining in CUDA code?

Is this language feature documented? It appears that one can inline PTX code in CUDA C code.

Here is a small example from Optix header files.

inline __device__ bool rt_report_intersection(unsigned int matlIndex)

  {

	int ret;

	asm volatile("call (%0), _rt_report_intersection, (%1);" :

				 "=r"(ret) :

				 "r"(matlIndex) :

				 );

	return ret;

  }

Whoa, that opens a whole new can of optimization worms (or at least if might make the job of arm wrestling the compiler to make it bend to your will a little easier, any way) External Image

It would probably be good if you didn’t rely on this behavior for the time being.

I think this feature is pretty powerful. Although this does not equivalent to what we can do in C/C++ inline assembly in which we can reference variable names directly, but the syntax is pretty clear (just like printf).

The feature exists for quite a while (at least v2.3 can do the trick). Use the OptiX source as a start, one can define a PTX function and call from CUDA source as follows:

__noinline__ __device__ uint _something(uint x, float y, double z)

{   

	//return z + y + x;

	uint ret;

	__asm("\

	   .reg .u32 %rx, %ry, %rz;\n\

	   mov.u32 %rx, %1;\n\

	   cvt.rzi.u32.f32 %ry, %2;\n\

	   cvt.rzi.u32.f64 %rz, %3;\n\

	   add.u32 %rx, %rx, %ry;\n\

	   add.u32 %rx, %rx, %rz;\n\

	   mov.u32 %0, %rx;" : "=r"(ret) : "r"(x), "f"(y), "d"(z) : );

	return ret;

}

__device__ uint something(uint x, float y, double z)

{   

	uint ret;

	asm ("call (%0), _Z10_somethingjfd, (%1,%2,%3);" : "=r"(ret) : "r"(x), "f"(y), "d"(z) : );

	return ret;

	_something(0, 0.0f, 0.0); // dummy call to avoid compiler optimization

}

This is just a dummy example, but I think it’s enough to show the all the possibilities with PTX inlining.

Parameters can be specified with corresponding types, here’s what I’ve found out:

d => double

f => float

h => short

l => long

m => ??

n => ??

r => int

s => ??

Also, as long as you use the asm parameter list trick, you must specify an output for this asm code (just like the “=r” in above example)

Of course I think it would be great to have variable specified directly within the PTX code, so we shouldn’t rely on this extensively. But for now the trick is convenient enough for us to do some PTX/CUDA hybrid compilation, which is AWESOME. Hope CUDA 3.0 would keep or improve this feature further!

Thanks for figuring this much out. It will be useful, but watch out - tmurray will haunt you in your dreams.

I know this thread is quite old, but I was wondering a couple of things. When using ptx via asm you still have to convert the memory addresses to their respective spaces, correct? Or is there a compiler feature that I could use to force all parameters to use their generic addresses?
If any of this has changed for 3.1, please let me know.
Please don’t kill me tmurray :)
Thanks!
Clamport