Cuda Memory Bank layout Interleaving, Addressing, Conflicts

Hi there,

A quick postscript to this topic about accessing global memory. I was finding severe slowdowns for particular matrix sizes (all still multiples of 16 to ensure coalesced global memory reads at all times) of a Cholesky factorization routine, and suspected it was to do with how global memory was partitioned between memory channels on the card. The problem was particularly nasty since it affected different cards in different ways.

My original code accessed “block-columns” (i.e. strips 16 wide) and so on reflection should have been particularly susceptible to problems, since then for certain widths all the data in a block-column could map to a single channel. I’ve now rewritten the code to access “block rows” (i.e. strips 16 high) and now the data in a block-row must be shared between channels. (New code here.)

It turns out that the severe (i.e. up a factor of 2 or 3) slowdowns are gone, and an 8800 GTX now almost always beats an 8800 GTS 640MB as one would hope. The runtime does still not quite uniformly increase with matrix size (e.g. a 12304x12304 matrix takes 7.8s whereas a 12288x12288 matrix takes 8.9s on a GTX, and a 12480x12480 matrix takes a particularly long 10.9s), but other effects like load balancing between blocks might also be having some affect at this level.

So it does seem that if one sees substantial slowdowns for certain array sizes in a program that does a lot of memory accessing it might be worth trying to either access the data more “rowwise” than “columnwise” or else pad the array somewhat (this latter is tricky because of the card-dependence though).

Best,
Steven.

I wonder how your observation is related to the peaks and valleys at the right side of the attached graph (results of bandwidthtest.exe, device–to-device, shmoo).

Hi there,

Without knowing how cudaMemcpy() works (and other hardware details) it’s hard to say for sure. The way memory is partitioned between channels must have an effect at some level, but shouldn’t perhaps be too important for copying one contiguous region of memory to another. You could try allocating a (multiple of) 256-byte array “in between” the target and destination which might cause/remove an unfavourable alignment and see if the peaks move around at all. You might need a small increment in array sizes too to see what is going on.

Best,
Steven.

Great research!

An obvious follow-up question: how is it best to divvy these memory partitions between threads? Obviously you’ve learned not to hammer one block from all the threads, but are there nuances? Eg, if given the option, should a multiprocessor stick to using the same partition over and over? And what about “meta-coallescing”–not just inside warps but inside a block or across blocks? Thing is, a 64-bit DDR3 partition must transfer data in multiples of 32-bytes but performs better if it transfers the whole 256B partition in one session. (EDIT: I was wrong about gddr4/5)

I think in all this is the answer to why my kernel runs much better if I run only one block per multiprocessor.

That might become difficult given, that it’s neither specified how warps nor how blocks are scheduled. However, feel free to try and share the results. Any gain is a gain after all.

Hi there,

I did wonder whether reading float4’s might be better as Alex suggests but apparently this is not the case. I haven’t tried too hard looking for optimal access patterns; I just wanted to understand and avoid the large slowdowns I was sometimes having. Of course it’d be great if there are some simple guidelines to follow that do lead to improved performance. (As theMarix says though, it is not clear how much you can do inter-block over and above using all partitions.)

After the sometimes-serious performance problems I had seen (processing block-columns of matrices – perhaps not too esoteric), I personally was hoping for some guidance in the cuda 2.0 documentation about global memory access over and above coalescing (much as you get for cpu’s with regards to caches, paging and channel interleaving). Maybe Nvidia might consider this in time for 2.1?!

Thanks,
Steven.