Preventing code illimination for nvcc ?

Hello,

As far as I can tell, simple kernel code like the one below gets completed illiminated by the nvcc/cuda compiler, is it possible to turn this “code illimination” off ? (I already tried -O0 (doesn’t seem to help, code is still illiminated(?!?))) ;) =D

adder.cu:

global void kernel( int a, int b )
{
int c;

c = a + b;

}

adder.ptx:

.version 1.4
.target sm_10, map_f64_to_f32
.entry _Z6kernelii (
	.param .s32 __cudaparm__Z6kernelii_a,
	.param .s32 __cudaparm__Z6kernelii_b)
{
.loc	16	1	0

$LDWbegin__Z6kernelii:
.loc 16 6 0
exit;
$LDWend__Z6kernelii:
} // _Z6kernelii

I’d like to have some simple kernels be generated so I can test them in visual studio cuda debugger of visual profiler…

Perhaps this empty kernel will already do the trick, but I’d much rather see a simple addition done so I know the parameters were passed correctly…

Bye,
Skybuck.

No, anything that does not write to any type of memory gets eliminated by the front-end no matter what the optimization level is.

I think this should be considered as a bug on the front-end’s side.

You can at least preserve the code up to the PTX stage by giving nvcc the flag [font=“Courier New”]–opencc-options=-O0[/font]. ptxas will still optimize it away though, even with [font=“Courier New”]–ptxas-options=-O0[/font].

So this means even hand writing a ptx file would be useless since hand writing assembler instructions would be “optimized/illiminated” away ?! :(

Assuming ptxas is called before the ptx file is executed by the driver ?

Or perhaps ptxas is only for cubins or so ?

I tried executing such an unoptimized ptx file… which had 6 registers and some mov’s… so far according to visual profiler it only executed 2 instructions or so… so this could be an indication that the runtime environment/and/or driver optimizes the files before running, so this code gets illiminated…

The CUDA Driver API does contain this enumeration:

	//
	// Level of optimizations to apply to generated code (0 - 4), with 4
	// being the default and highest level of optimizations.
	// Option type: unsigned int
	//
	CU_JIT_OPTIMIZATION_LEVEL,

I am not yet sure what it’s for… or which api function it’s for… something about “online” compiler…

Ok I see it’s for:

cuModuleLoadDataEx

So perhaps with this API it’s possible to prevent the code illimination from happening… when loading the ptx file…

However there is a little problem… the api is not really for a file… it’s only for “an image”, which is some kind of memory structure.

If you want control over what code goes into the image, perhaps you really have to use an assembler. I’m working on one; PathScale also has one. Though the progress on PathScale’s side seems to have stopped (at least not that we could see. Perhaps if you buy their Enzo and they will tell you more).

Hmmm what will the input be to your assembler ? Will it be PTX ? or perhaps some higher language ? :)

It takes the format of the output of cuobjdump and it does no optimization

So far it’is only partially functional because I haven’t implemented the rules for quite a lot of instructions.

It won’t be complete for at least another two months unless someone else is willing to take up my work…

Could you perhaps copy & paste a small example of such a cuobjdump ? might be interesting to include such examples on your website as well, so people can get an idea of what this is all about ;)

You can try this: cuobjdump -sass cubinFileName_or_executableFileName

Just make sure you’re on toolkit 4.0

Yeah, but the problem is I have no decent kernels yet (or any decent cubins !?!), so perhaps you can provide a little example ?

(I am not even sure what a cubin is… I think I asked question about that… maybe I forgot answer ;) :))

I did take look at cubin.pdf or something like that… with the micro instruction set, looks somewhat interesting ! ;)

Also I am not sure if cubins would be usefull for me ?!?

Perhaps they can be loaded and used as an image to LoadDataEx ?