Injecting PTX Inject whole PTX functions

So I noticed that the C to PTX compiler doesn’t seem to do a good job of loading the parameters and generally seems to load them all up into registers at the beginning of the function, rather than when they are needed.

The only way I can see of avoiding this, and manually ensuring the ld.param is used only when it is actually needed, is to code the entire ptx function manually.

Is there any way to inject/inline an entire PTX function block? i.e. not just asm within a C++ device function. But an entire block of PTX. I know I can call it from my code using asm, but there doesn’t seem to be any easy way of doing this code injection without first breaking up the compilation and manually copying the new PTX code into the intermediate PTX file.

A pragma would be nice;

#pragma INJECT_PTX_BEGIN
… pure ptx
#pragma INJECT_PTX_END

Any ideas/thoughts?

So I noticed that the C to PTX compiler doesn’t seem to do a good job of loading the parameters and generally seems to load them all up into registers at the beginning of the function, rather than when they are needed.

The only way I can see of avoiding this, and manually ensuring the ld.param is used only when it is actually needed, is to code the entire ptx function manually.

Is there any way to inject/inline an entire PTX function block? i.e. not just asm within a C++ device function. But an entire block of PTX. I know I can call it from my code using asm, but there doesn’t seem to be any easy way of doing this code injection without first breaking up the compilation and manually copying the new PTX code into the intermediate PTX file.

A pragma would be nice;

#pragma INJECT_PTX_BEGIN
… pure ptx
#pragma INJECT_PTX_END

Any ideas/thoughts?

Have you compiled the PTX into a device binary, then disassembled it to see the instructions that are actually used? PTX is a virtual architecture, so even though it’s loading the parameters into registers right at the beginning of the kernel, it doesn’t mean that the device is going to execute the kernel like that – the PTX compiler within the CUDA driver is going to parse the PTX, then perform register allocation to map the virtual registers to various hardware registers and possibly re-order some of the instructions as well for maximum efficiency.

tl;dr - PTX is not ‘WYSIWYG’.

Have you compiled the PTX into a device binary, then disassembled it to see the instructions that are actually used? PTX is a virtual architecture, so even though it’s loading the parameters into registers right at the beginning of the kernel, it doesn’t mean that the device is going to execute the kernel like that – the PTX compiler within the CUDA driver is going to parse the PTX, then perform register allocation to map the virtual registers to various hardware registers and possibly re-order some of the instructions as well for maximum efficiency.

tl;dr - PTX is not ‘WYSIWYG’.