linking hand-coded PTX

I think the title sums it up.

I’ve written some PTX assembly, I’ve successfully compiled said code using ptxas, but i’m struggling to find a way of integrating it into my program.

Is the only way to do this through the device code repository mechanism? I’m trying to figure it out now, but it’s far from straightforward (tips appreciated!). Surely there’s some more direct way, such as compiling the ptx into a .obj and linking, but I haven’t been able to figure it out.

I really wish I could just simply pass .ptx files along with .cu to nvcc. Or better yet, inline it into the .cu.

Given the poor quality of the current compiler, ptx is turning out much more valuable than hand assembly should normally be. P.S. I’d really appreciate a “no optimizations” option in ptxas, and even a full gammut of -O’s.

I haven’t spent any time on ptxas yet, but it seems that you should be able to write a PTX function, create a header file for it, and call it from your .cu files yes? Which compiler stage does the function inlining? This came to mind when someone was asking about calling SAD from their CUDA code. How does nvcc pull in intrinsics like _expf() etc, do they have ptx inlining implemented or something? I’ve had passing interest in this but all of my CUDA coding so far has been in C, and I’ve only been reading the PTX to verify that the compiler is doing what I want. I haven’t felt the need to start writing in PTX itself.

John Stone

ok, I’ve gotten the hang of the code repository somewhat. Actually, it’s a fairly powerful and simple to use (though poorly documented!) feature.

Basically, it works like this:
You can change kernels without recompiling the executable by simply creating or replacing files in a special directory. If your executable is called “./L33tProg.exe,” the runtime will automatically check for kernel implementations in a folder (or tar file!) called “./L33tProg.devcode”. The kernels can be either in ptx or cubin form.

Creating the ptx/cubin files is a bit difficult, though, because there are some sort of restrictions on the contents of the files, and there’s no error reporting except when your kernel runs fine and gives broken results. To not have to start from scratch, you can add the “-dir=$(ProjectName).exe.devcode -ext=all –int=none –arch compute_10 –code compute_10,sm_10,sm_11” compile flags. These will generate the L33tProg.devcode folder that contains all the kernels in your project. The flags also make it so that the executable requires this folder and doesn’t have embedded kernels itself. That’s a debugging trick so you never have doubts whether the new kernel gets loaded.

However, programming in ptx is turning out especially difficult with no debugging. In fact, you’re not even really informed when the kernel totally fails except that the output data looks a certain way. ptxas doesn’t emit informative syntax errors either, and sometimes just dies with “internal error” with no linenumber. It would be great if ptx could be compiled to host code and debugged just like cu. Ah… I dream of inline ptx. And of a better assembly syntax… maybe something that looks like c but isn’t (so a mov is an assignment, a cvt is a cast, an ALU op is an expression… but you can only do one thing on a line). Hell, ptx ain’t real assembly anyway.

p.s. the proper docs are in chapter 6 of C:\CUDA\doc\NVCC_1.0.pdf

I’m giving up hand-coding ptx assembly because bugs in ptxas are making it impossible.

I’d like to submit a repro case that is able to reproduce two different ways that ptxas craps itself. Who can I send it to?

You can send me a message.