I have a constant variable, which is used by my kernel to perform a computation on an array on input structures.
My structures are 68 bytes. All fields in the input structure are needed. The kernel firstly loads all structures to shared memory, where the computations are done. The kernel then writes the updated structures back to global memory. Apart from the initial read, and final write, no more memory accesses are needed and all the rest of the work in the kernel are computations in shared memory, with each thread only accessing its own element in shared memory (ie, no bank conflicts, etc).
The size of my structure is obviously causing problems wrt the amount of shared memory available.
My first solution was:
128 threads per block
8704 + 32 (from the cubin file) bytes of shared memory.
9 registers
Data from the calculator
Active threads per MP: 128
Active warps per MP: 4
Active thread blocks per MP: 1
Occupancy: 13%
I’m obviously being limited here by the amount of shared memory.
The second solution is:
64 threads per block
4352 + 32 bytes of shared memory
9 registers
Data from calculator:
Active threads per MP: 192
Active warps per MP: 6
Active thread blocks per MP: 3
Occupancy: 19%
Strangely however, solution 1 performs better than solution 2. On an input of 1,280,000 structures solution 1 takes ~21ms to complete. Solution 2 takes ~24ms.
Could anyone explain to me why I’m seeing these performance differences? Or even, how to optimize my kernel, given the amount of shared memory it needs.
It’s really not that surprising, I have often had kernels with lower occupancy that perform better than some with higher occupancy. Occupancy only gives the kernel a better chance to avoid idling, and from sthat standpoint, I think both your occupancies are small. Once you get above 30% or so, occupancy is less of an issue. On the other hand, it is recommended that you maintain at least 128, better 256 threads per block.
What are your Profiler indicators? Do you get any thread serialization? Because, even though each threads access its own element, the elements are 68 byte each. How about coalescence?
One cannot really help you optimize your kernel unless you post some code.
Actually it is recommended to have 192 threads per multiprocessor. vvolkov (from the fast sgemm and fft) has shown some interesting results where it turns out that 64 threads per block is actually the optimum when kernels are compute bound.
My own tests have independantly confirmed “don’t run 1”, but only on compute 1.0 hardware: Just about every kernel in HOOMD is significantly slower with a 32 block size. The key is that I’ve only seen this on 1.0 hardware. I’ve got some code that actually runs optimally with 1 warp on GTX 280.
I always take the simple approach: You guys can all talk/argue/speculate/whatever until your fingers are blue but when it comes down to it there simple is no substitute for experimentation (and this coming from a theroetical phyics guy…). Just write up your kernel to run with any block size and benchmark the darn thing. It will take less time than responding to these forum posts and you will get your answer as to what is the fastest block size for any particular kernel (on the hardware you are benchmarking, at least). I’ve written scripts to do this long ago in HOOMD and I’ve never noticed any patterns in the output as the kernels have evolved and needed retuning.
Of course. And since I’m sitting at a compute 1.1 machine now, I can even generate plots (python + matplotlib rocks). Here are timing measurements vs block size for 2 key kernels in HOOMD timed on a 9800 GTX. (note, I just noticed the time axis is milsabled… it is in seconds not milliseconds)
In the case for lj: there are 22 regs used and little to no smem. The occupancy calculator predicts the highest occpancy is out near ~300. This matches with the measured: so here is one case where occupancy did seem to track with performance.
For nlist: there are 20 regs used and little to no smem. The occupancy calculator predicts max occupancy at 128, 192 and 384 block sizes. The measured performance in these regions is bad! The best performing block size is 160.
The performance fluctuations are 15% from fastest to slowest, so this is a very significant effect.
Edit: ack, file names were removed after posting. lj is on the left and nlist is on the right.
This was the one I was referring to. While 32 is not the fastest block size in this run, it loses to 352 only by a hair. For some reason, 32 tends to win a lot more on GTX 280 vs S1070: maybe something to do with the different memory layouts.
For those who still care about occupancy after my debunking: 352 is at the top of the occupancy charts, but 32 is way down at the bottom. Somebody explain that one.
I think that is the fact that according to the occupancy calculator, even if you use 1 warp in a block, 2 warps worth of registers is used.
And I do hearthily agree with MrAnderson, the only way to know what works best is to benchmark. It is not possible for all of my kernels to easily change blocksize, but it is always how I try to start out.