How to use --ptxas output With respect to optimization

Could someone enlighten me or post an example as to how you might use the --ptxas memory usage output to guide optimization efforts?

I see it lists register usage, shared memory usage, constant memory usage for all of my kernels. Perhaps not coincidentally, my slowest kernel has the highest register count. How do I know how many registers is too many? What is the significance of smem usages of “32+16”, etc.?

Perhaps a related question: some of my kernels are things like:

void foo(const Complex * ptrA, float * ptrB);

foo() is called hundreds of times, and the arguments are always the same each call. Does this mean it’s a good idea to move ptrA and ptrB to device constant memory, and omit the arguments? Would this reduce my register usage (at the expense of more cmem being used)? Could this improve performance?

Thanks!

Hi,

Use the output with the CUDA Occupancy Calculator (included in the SDK under the tools directory). The significance of smem “32+16” I found it on the NVCC Manual (included in the Toolkit and usually installed in Linux under /usr/local/cuda/doc).

ptrA and ptrB are not constants. They are decided at run-time by the CUDA Runtime when you call cudaMalloc. I don’t see how you can make them constant.

Rodrigo

Hi Rodrigo-

Thanks for your response. I have been poking around with the occupancy calculator. I’m interested to hear any anecdotes about how the occupancy calculator helped anybody optimize. Ie “I saw my occupancy was limited to X because of Y, therefore I did Z, and saw performance improvements.”

Re: device constants: You are right that the memory addresses aren’t known at compile time, but you can obtain the memory locations via cudaMalloc, then copy the resulting addresses to the device constants using cudamemcpytosymbol. I made this change, and it still produced correct results. The net effect seems to be using less smem but more cmem (which I guess implies that arguments to kernels normally get stored in smem), not sure if I saw any performance improvement or not.

When implementing mathematical morphology operations I saw my occupancy was limited to less than 20 % because of high shared memory usage, therefore I did some code optimalization to cut down the shared memory usage, and saw performance improvements.

;)