Occupancy Query Performance not as expected

I have a constant variable, which is used by my kernel to perform a computation on an array on input structures.

My structures are 68 bytes. All fields in the input structure are needed. The kernel firstly loads all structures to shared memory, where the computations are done. The kernel then writes the updated structures back to global memory. Apart from the initial read, and final write, no more memory accesses are needed and all the rest of the work in the kernel are computations in shared memory, with each thread only accessing its own element in shared memory (ie, no bank conflicts, etc).

The size of my structure is obviously causing problems wrt the amount of shared memory available.

My first solution was:

128 threads per block
8704 + 32 (from the cubin file) bytes of shared memory.
9 registers

Data from the calculator
Active threads per MP: 128
Active warps per MP: 4
Active thread blocks per MP: 1
Occupancy: 13%

I’m obviously being limited here by the amount of shared memory.

The second solution is:

64 threads per block
4352 + 32 bytes of shared memory
9 registers

Data from calculator:
Active threads per MP: 192
Active warps per MP: 6
Active thread blocks per MP: 3
Occupancy: 19%

Strangely however, solution 1 performs better than solution 2. On an input of 1,280,000 structures solution 1 takes ~21ms to complete. Solution 2 takes ~24ms.

Could anyone explain to me why I’m seeing these performance differences? Or even, how to optimize my kernel, given the amount of shared memory it needs.

Cheers :)

It’s really not that surprising, I have often had kernels with lower occupancy that perform better than some with higher occupancy. Occupancy only gives the kernel a better chance to avoid idling, and from sthat standpoint, I think both your occupancies are small. Once you get above 30% or so, occupancy is less of an issue. On the other hand, it is recommended that you maintain at least 128, better 256 threads per block.

What are your Profiler indicators? Do you get any thread serialization? Because, even though each threads access its own element, the elements are 68 byte each. How about coalescence?

One cannot really help you optimize your kernel unless you post some code.

You shouldn’t be overly bothered by occupancy, it’s not really a very good performance predictor.

What I’d like to know is how you load the structures? Is it one thread per structure? Sort of like this?

yourStruct data[N];

__global__ void kernel(...) {

__shared__ yourStruct shdata[128]

int tid = threadIdx.x + blockIdx.x*blockDim.x;

shdata[threadIdx.x] = data[tid];

//computations

data[tid] = shdata[threadIdx.x];

}

Yes, my code is structured as above :)

Actually it is recommended to have 192 threads per multiprocessor. vvolkov (from the fast sgemm and fft) has shown some interesting results where it turns out that 64 threads per block is actually the optimum when kernels are compute bound.

Are you referring to section 3.4 of Volkov’s awesome LU QR Cholesky paper?

I always wondered about that. It says that 2 warps are faster than 1, even if there are 2 blocks each with 1 warp.

What it doesn’t answer is if 3 or 4 warps was faster or slower than 2.

But 2 warps in one block is close to optimal throughput (98%).

There’s also some popular voodoo about odd warp counts being less efficient (ie, run 2, 4, or 6 warps, not 1, 3, or 5)

but I don’t know the source of this idea other than Volkov’s conclusive “don’t run 1!” test.

Of course I’m lazy, I should just write my own tests. But maybe Vasily will drop in and fill in some details.

My own tests have independantly confirmed “don’t run 1”, but only on compute 1.0 hardware: Just about every kernel in HOOMD is significantly slower with a 32 block size. The key is that I’ve only seen this on 1.0 hardware. I’ve got some code that actually runs optimally with 1 warp on GTX 280.

I always take the simple approach: You guys can all talk/argue/speculate/whatever until your fingers are blue but when it comes down to it there simple is no substitute for experimentation (and this coming from a theroetical phyics guy…). Just write up your kernel to run with any block size and benchmark the darn thing. It will take less time than responding to these forum posts and you will get your answer as to what is the fastest block size for any particular kernel (on the hardware you are benchmarking, at least). I’ve written scripts to do this long ago in HOOMD and I’ve never noticed any patterns in the output as the kernels have evolved and needed retuning.

Out of curiosity, did you check on 1.1 hardware?

Of course. And since I’m sitting at a compute 1.1 machine now, I can even generate plots (python + matplotlib rocks). Here are timing measurements vs block size for 2 key kernels in HOOMD timed on a 9800 GTX. (note, I just noticed the time axis is milsabled… it is in seconds not milliseconds)

In the case for lj: there are 22 regs used and little to no smem. The occupancy calculator predicts the highest occpancy is out near ~300. This matches with the measured: so here is one case where occupancy did seem to track with performance.

For nlist: there are 20 regs used and little to no smem. The occupancy calculator predicts max occupancy at 128, 192 and 384 block sizes. The measured performance in these regions is bad! The best performing block size is 160.

The performance fluctuations are 15% from fastest to slowest, so this is a very significant effect.

Edit: ack, file names were removed after posting. lj is on the left and nlist is on the right.
nlist.png
lj.png

For the curious: here is exactly the same benchmark on S1070 (sorry, no pretty graphs this time. The server doesn’t have matplotlib)

lj:

32 0.002436818

64 0.002678256

96 0.002605731

128 0.002797218

160 0.002566049

192 0.002718279

224 0.002561091

256 0.002683243

288 0.002706816

320 0.002765698

352 0.002416811

384 0.002433111

416 0.002559374

448 0.002622363

480 0.002700192

512 0.002794905

Fastest block is 1.1574003925 faster than the slowest

Fastest block size is:  352

This was the one I was referring to. While 32 is not the fastest block size in this run, it loses to 352 only by a hair. For some reason, 32 tends to win a lot more on GTX 280 vs S1070: maybe something to do with the different memory layouts.

For those who still care about occupancy after my debunking: 352 is at the top of the occupancy charts, but 32 is way down at the bottom. Somebody explain that one.

nlist:

32 0.03850846

64 0.03653006

96 0.0371442

128 0.03734375

160 0.03772143

192 0.03745411

224 0.03849534

256 0.0378034

288 0.03847842

320 0.03637109

352 0.03728528

384 0.03741516

416 0.0399726

448 0.03599329

480 0.03841788

512 0.03708578

Fastest block is 1.11055699548 faster than the slowest

Fastest block size is:  448

Nothing new here.

For those keeping score: at least in my limited sample size here, we’ve got 2 points for the even warps and 2 points for the odd warps.

Any more block size myths to debunk? I’ve got lots more kernels I can benchmark with this script, so bring it on.

Thanks, very informative!

I think that is the fact that according to the occupancy calculator, even if you use 1 warp in a block, 2 warps worth of registers is used.

And I do hearthily agree with MrAnderson, the only way to know what works best is to benchmark. It is not possible for all of my kernels to easily change blocksize, but it is always how I try to start out.