CUDA Occupancy Calculator accuracy?

I ran the occupancy calculator for 2 version of my code. One version uses 15 registers and the other uses 11 registers. In both cases, I use a block size of 128 threads. The 15 register version basically stored more locals in registers (e.g. calculations like 2*x which is used in array indexing and other common expressions is precalculated calculated and stored in a register). Both cases uses 48 words(?) of shared memory according to the cubin output.

According to the spreadsheet, the fact I used 4 more registers should have dropped my GPU occupancy from around 80% to 67%.

But running the CUDA profiler reports that the 11 register version runs a fair bit slower (2100 microseconds) compared to the 15 register version (1700 microseconds).

I am happy to see my effort at optimization has paid off (400 microseconds better) but it doesn’t seem to match what the occupancy calculatorpredicts. Has anyone come across a reasonable explanation?

BTW, should the nvcc compiler be doing common expression optimization when I am using -O3 rather than me doing it by hand?


Higher occupancy doesn’t gaurantee better performance…

You can have a poor-performing algorithm that has 100% occupancy, and another very very smart one that uses a lot of registers but eliminates redundant calculations, and have it end up running at 33% occupancy but run twice as fast. Same goes for arithmetic intensity. I have two very different versions of one of my kernels, one hits 268 GFLOPS and was the fastest version I’d been able to make myself. Mark Harris gave me some suggestions and a modification to my code that dropped the arithmetic rate down to 200 GFLOPS, but his version runs up to 7% faster than mine did. :-)



I think nvcc ignores the optimization settings when doing cubins currently. In the comment atop the ptx file it suggests that it always uses the same flags. Run it with -v and you see the flags passed on to the secondary compilers.


There are a couple of things to know about occupancy. (I’ll add these to the documentation in the calculator).

1.) Occupancy != Performance. If you are not bandwidth bound, then increasing occupancy won’t necessarily increase performance. If you already have at least one thread block per multiprocessor, and you are bound by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to local memory (which is off chip), divergent branches, etc. You need to experiment (as you did) to see how your changes affect your wall clock time.

2.) The “smem=” shared memory usage reported in the cubin file is only the statically allocated shared memory used by the kernel, which includes shared variables that are statically sized, and parameters. It does not include dynamically allocated shared memory (“extern shared float foo”), because this is only determined at run time. You need to add the value in the cubin to your known size of dynamically allocated memory before entering it into the calculator. (This isn’t in reference in your post, it’s just something I thought I should mention here).