I coded an algorithm (which I previously wrote in assembler) with CUDA to see how much performance do I get.
However, with .cu kernels I get no good results (just a little speedup). After looking at the ptx and the cubin I saw:
16 registers in cubin, hundreds of “st.local.u32” and “ld.local.u32” in ptx.
The ptx reference says that a local read/write costs hundreds of clocks.
However, my assembler implementation is able to use 30 32bit registers and requires no other storage and I think, if I could optimize the ptx then I’d get much better results despite the lower occupancy.
My questions are:
- Should I try to write my own ptx with 32 registers with no other memory usage?
- How does it impact speed if I write procedures(call) in ptx instead of inlining?
- How can I link my own ptx kernel along with the interface .cu code?
(I looked on the forum, all I found was nvcc -v and with this argument I get some output "#$ SPACE= " and so on, that I don’t know what to do with.)
(Maybe there is a tool, which replaces the cubin file in my compiled exe with my own cubin file or anything similar.)