Optimizing ptx

Hello,
I coded an algorithm (which I previously wrote in assembler) with CUDA to see how much performance do I get.
However, with .cu kernels I get no good results (just a little speedup). After looking at the ptx and the cubin I saw:
16 registers in cubin, hundreds of “st.local.u32” and “ld.local.u32” in ptx.
The ptx reference says that a local read/write costs hundreds of clocks.
However, my assembler implementation is able to use 30 32bit registers and requires no other storage and I think, if I could optimize the ptx then I’d get much better results despite the lower occupancy.
My questions are:

  1. Should I try to write my own ptx with 32 registers with no other memory usage?
  2. How does it impact speed if I write procedures(call) in ptx instead of inlining?
  3. How can I link my own ptx kernel along with the interface .cu code?
    (I looked on the forum, all I found was nvcc -v and with this argument I get some output "#$ SPACE= " and so on, that I don’t know what to do with.)
    (Maybe there is a tool, which replaces the cubin file in my compiled exe with my own cubin file or anything similar.)

Ptx is not a real assembly language and doesn’t give you control over register usage. Some of the Nvidia guys could plausibly argue that IA-32 isn’t a real assembly language either because of a trace cache in all modern or even a lot of deprecated processors. If you are using local memory, something is probably wrong or at least suboptimal with you CUDA code.

For control over registers, you’ll need to use decuda:

http://www.cs.rug.nl/~wladimir/decuda/

an unofficial, reverse engineered cubin assembler/disassembler. This lets you manipulate the cubin produced by ptxas (or write your own cubin from scratch, without going through nvcc/ptxas).

There is indeed a compiler option to put the .cu code as comments in the ptx. I forgot which it is, a small search should do the trick.

Local memory usage can be a cause of using sin, cos or sincos. The way these are implemented, there is a fast path and a slow path. Depending on your input to the functions on of those paths is chosen. So when using these in your kernel, you get always local memory usage reported (it made me crazy as I could not find the reason)

First of all, to make it clear: I don’t want to write my own ptx to get more control over registers, but to get more control over the local memory usage. I just thought that if I wouldn’t use any local memory inthe ptx and reuse the registers, then ptxas would not think about implementing more registers or local memory.

Well, of course, for me it would be much better to optimize my C code to avoid local memory usage. That’s why I think I’ll describe what’s happening inside.

The thing is - I do not use any floating point operations, everything is unsigned int.

(And I still think it’s worth coding on a GPU,because even with hundreds of local memory accesses I counted them - in the 1500lines ptx there are 244 local memory accesses and many unnecessary code it was faster than on a dualcore CPU.)

Most of the unsigned ints are unsigned int[4], however I figured out that copying by [0]=[0]; [1]=[1]; [2]=[2]; [3]=[3]; is better than *(uint4 )=(uint4 *) because the latter guarantees local memory usage in my case.

Inside my kernel there are two functions called many times one after another (no loops).

If I call only one function once in my kernel, then no local variables are used.

If I call two then the local memory usage stars its linear growth.

InterationKernel() 

{

	extern __constant__ unsigned int d_R[branchCount][3][4];

	extern __device__ unsigned int d_P[blockCount][threadCount][3][4];

        unsigned int X2[4],A[4];

	copy_each_dword(X2,d_P[blockIdx.x][threadIdx.x][0]);

	int j=X2[0] % branchCount;

	function1(X2,X2,d_R[j][0]);

	function2(A,X2,X2);

	copy_each_dword(d_P[blockIdx.x][threadIdx.x][1],A);

}

As you see I’m using constant and device memory.

The functions themselves use the inputs and ~7 auxiliary variables all unsigned int (no arrays) and only ±*__umulhi instructions.

The only compare is “carry=(aux>v)?1:0;” which is not a branch.

If you see anything, that leads to local memory usage, I would like to know it.

What I forgot to say is, what I really need (would save two registers in both of my functions) is an add with carry (addc) and a subtract with borrow function, which I read of in a post from the decuda author. That’s why writing ptx may be even better.

They’re related: automatic (“local”, in the C language sense) variables are first placed in registers, then if there are too many, they spill over into local (in the CUDA sense) memory.

Later, you say:

When nvcc inlines the functions, their auxiliary variables become automatic variables in the calling function. If you call only one of them, you can fit all these automatic variables into registers. But when you call both functions, you add 7 variables, some of which spill over into local memory.

Having said that: note that just because the ptx code shows something as being in local memory doesn’t mean that it’s there in the final cubin file. The ptx assembler can optimize away local memory uses by reusing registers.

Yes, ptxas is aggressive in reusing registers. However, ptx just deals with virtual registers. The programming guide’s description gives the exact opposite view. Just because ptx uses registers, the ptxas tool can still convert those registers to local memory. If ptx puts something in local memory, then it should stay there. Generally, avoid putting arrays in kernels and maximize the use of shared memory.

What does the extern keyword do?

Has anyone used ptx or cudasm to generate real-world kernels, not just small benchmarks, and gotten significant performance improvements, say at least a factor of two over optimized CUDA code?

So it’s even more tricky than I thought. However, I don’t think that in with this code I’ll use decuda/cudasm.

NVCC seems to do everything to keep the register count below 16. Maybe there is a way to force it use 32 registers or set local memory to zero?

Still the question remains: what is theoretically faster in a 1500l code

256 threads, 16 regs, ~200 local memory accesses, no shared memory accesses

or

256 threads, 32 regs, no local memory accesses, no shared memory accesses

In the first case the occupancy would be 67% in the second 33% but the first one would need 200*200 more clocks.

For my 1500 lines*3 clocks(assume:avg) the whole programm would need 1/10 of the clocks. So ~5x speedup. Is this assumption correct?

They mean that the declaration is somewhere else (my main.cu).

So unsigned int[4] or even unsigned int[4][4] is bad in general?

Should I better use uint4 and unt4 a,b,c,d?

Registers are not indexable. So, if you create a local array and access array[i], the compiler will use local memory. This is probably what is leading to your local memory usage. The compiler will happily use 32 or more registers without spilling into local memory.

  1. Has anyone used ptx or cudasm to generate real-world kernels, not just small benchmarks, and gotten significant performance improvements, say at least a factor of two over optimized CUDA code?

  2. If the index into the array is a compile time constant and the array is small, then the nvcc compiler may choose to store each element into a register. If the indices into the array are not compile time constants, then the array will almost certainly be put into local memory. Thus, int4 may be better than int tmp[4]. In general, if you need lots of memory use the shared memory. That is what it is there for. The three clocks average figure seems to be pulled out of thin air. Local memory accesses should have 400 to 600 cycles of memory latency. You only pay the full latency penalty when there is no way to reorder instructions such that the memory accesses are non-blocking. Beyond 192 threads per a multiprocessor, increasing occupancy does not increase performance unless there are stalls due to memory accesses or you are using some of the higher latency instructions that execute on the SFUs.

Thank you, your suggestion not to use arrays was correct. Now I have just a few local memory accesses (but no idea why they are used there). However, the register count exploded (not it’s 41) which ruins the overall performance.
The most strange thing is - if I use shared memory instead of one of the uint4 from my registers the register count doesn’t change but the memory is used.
Here I have to assert that in the ideal implementation no more than 30 registers and no other memory should be used if one reuses them perfectly. Moreover the one uint4 I relocated into shared memory cannot be reused if it is in the register space, that’s why I’m confused that the register count remained high.

One more thing:
passing everything by reference just increased the local memory usage, the register count remained the same.