CUDA Occupancy Calculator accuracy?

spencer · March 22, 2007, 9:53pm

I ran the occupancy calculator for 2 version of my code. One version uses 15 registers and the other uses 11 registers. In both cases, I use a block size of 128 threads. The 15 register version basically stored more locals in registers (e.g. calculations like 2*x which is used in array indexing and other common expressions is precalculated calculated and stored in a register). Both cases uses 48 words(?) of shared memory according to the cubin output.

According to the spreadsheet, the fact I used 4 more registers should have dropped my GPU occupancy from around 80% to 67%.

But running the CUDA profiler reports that the 11 register version runs a fair bit slower (2100 microseconds) compared to the 15 register version (1700 microseconds).

I am happy to see my effort at optimization has paid off (400 microseconds better) but it doesn’t seem to match what the occupancy calculatorpredicts. Has anyone come across a reasonable explanation?

BTW, should the nvcc compiler be doing common expression optimization when I am using -O3 rather than me doing it by hand?

Spencer

tachyon_john · March 23, 2007, 2:40am

Higher occupancy doesn’t gaurantee better performance…

You can have a poor-performing algorithm that has 100% occupancy, and another very very smart one that uses a lot of registers but eliminates redundant calculations, and have it end up running at 33% occupancy but run twice as fast. Same goes for arithmetic intensity. I have two very different versions of one of my kernels, one hits 268 GFLOPS and was the fastest version I’d been able to make myself. Mark Harris gave me some suggestions and a modification to my code that dropped the arithmetic rate down to 200 GFLOPS, but his version runs up to 7% faster than mine did. :-)

Cheers,

John

I ran the occupancy calculator for 2 version of my code. One version uses 15 registers and the other uses 11 registers. In both cases, I use a block size of 128 threads. The 15 register version basically stored more locals in registers (e.g. calculations like 2*x which is used in array indexing and other common expressions is precalculated calculated and stored in a register). Both cases uses 48 words(?) of shared memory according to the cubin output.

According to the spreadsheet, the fact I used 4 more registers should have dropped my GPU occupancy from around 80% to 67%.

But running the CUDA profiler reports that the 11 register version runs a fair bit slower (2100 microseconds) compared to the 15 register version (1700 microseconds).

I am happy to see my effort at optimization has paid off (400 microseconds better) but it doesn’t seem to match what the occupancy calculatorpredicts. Has anyone come across a reasonable explanation?

BTW, should the nvcc compiler be doing common expression optimization when I am using -O3 rather than me doing it by hand?

Spencer

[snapback]174734[/snapback]

prkipfer · March 23, 2007, 10:00am

I think nvcc ignores the optimization settings when doing cubins currently. In the comment atop the ptx file it suggests that it always uses the same flags. Run it with -v and you see the flags passed on to the secondary compilers.

Peter

Mark_Harris · March 26, 2007, 4:33pm

There are a couple of things to know about occupancy. (I’ll add these to the documentation in the calculator).

1.) Occupancy != Performance. If you are not bandwidth bound, then increasing occupancy won’t necessarily increase performance. If you already have at least one thread block per multiprocessor, and you are bound by computation and not by global memory accesses, then increasing occupancy may have no effect. In fact, making changes just to increase occupancy can have other effects, such as additional instructions, spills to local memory (which is off chip), divergent branches, etc. You need to experiment (as you did) to see how your changes affect your wall clock time.

2.) The “smem=” shared memory usage reported in the cubin file is only the statically allocated shared memory used by the kernel, which includes shared variables that are statically sized, and parameters. It does not include dynamically allocated shared memory (“extern shared float foo”), because this is only determined at run time. You need to add the value in the cubin to your known size of dynamically allocated memory before entering it into the calculator. (This isn’t in reference in your post, it’s just something I thought I should mention here).

Mark

Topic		Replies	Views
occupancy and performance also a question about .cubin files CUDA Programming and Performance	6	2276	December 9, 2009
CUDA Occupancy Calculator Helps pick optimal thread block size CUDA Programming and Performance	76	312592	September 13, 2011
question about register and performance CUDA Programming and Performance	3	6746	September 22, 2008
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5965	July 25, 2007
Occupancy doesn't tally with calculator CUDA Programming and Performance	3	1662	January 17, 2009
Reducing the number of registers To improve occupancy CUDA Programming and Performance	5	4759	April 5, 2007
Occupancy calculator v.s. profiler CUDA Programming and Performance	0	2738	January 19, 2011
I've a question about CUDA Occuapncy Calculator by NVIDIA CUDA Programming and Performance	13	2659	March 5, 2013
help me understand `odd' performance CUDA Programming and Performance	5	1720	June 18, 2010
Occupancy Mystery lo-occ,hi-reg faster than hi-occ,lo-reg? CUDA Programming and Performance	7	5320	September 25, 2008

CUDA Occupancy Calculator accuracy?

Related topics