Cuda Memory Bank layout Interleaving, Addressing, Conflicts

sgratton · August 6, 2008, 9:26am

Hi there,

A quick postscript to this topic about accessing global memory. I was finding severe slowdowns for particular matrix sizes (all still multiples of 16 to ensure coalesced global memory reads at all times) of a Cholesky factorization routine, and suspected it was to do with how global memory was partitioned between memory channels on the card. The problem was particularly nasty since it affected different cards in different ways.

My original code accessed “block-columns” (i.e. strips 16 wide) and so on reflection should have been particularly susceptible to problems, since then for certain widths all the data in a block-column could map to a single channel. I’ve now rewritten the code to access “block rows” (i.e. strips 16 high) and now the data in a block-row must be shared between channels. (New code here.)

It turns out that the severe (i.e. up a factor of 2 or 3) slowdowns are gone, and an 8800 GTX now almost always beats an 8800 GTS 640MB as one would hope. The runtime does still not quite uniformly increase with matrix size (e.g. a 12304x12304 matrix takes 7.8s whereas a 12288x12288 matrix takes 8.9s on a GTX, and a 12480x12480 matrix takes a particularly long 10.9s), but other effects like load balancing between blocks might also be having some affect at this level.

So it does seem that if one sees substantial slowdowns for certain array sizes in a program that does a lot of memory accessing it might be worth trying to either access the data more “rowwise” than “columnwise” or else pad the array somewhat (this latter is tricky because of the card-dependence though).

Best,
Steven.

aakova · August 6, 2008, 9:03pm

I wonder how your observation is related to the peaks and valleys at the right side of the attached graph (results of bandwidthtest.exe, device–to-device, shmoo).

sgratton · August 7, 2008, 11:34am

Hi there,

Without knowing how cudaMemcpy() works (and other hardware details) it’s hard to say for sure. The way memory is partitioned between channels must have an effect at some level, but shouldn’t perhaps be too important for copying one contiguous region of memory to another. You could try allocating a (multiple of) 256-byte array “in between” the target and destination which might cause/remove an unfavourable alignment and see if the peaks move around at all. You might need a small increment in array sizes too to see what is going on.

Best,
Steven.

alex_dubinsky · September 1, 2008, 6:58pm

Great research!

An obvious follow-up question: how is it best to divvy these memory partitions between threads? Obviously you’ve learned not to hammer one block from all the threads, but are there nuances? Eg, if given the option, should a multiprocessor stick to using the same partition over and over? And what about “meta-coallescing”–not just inside warps but inside a block or across blocks? Thing is, a 64-bit DDR3 partition must transfer data in multiples of 32-bytes but performs better if it transfers the whole 256B partition in one session. (EDIT: I was wrong about gddr4/5)

I think in all this is the answer to why my kernel runs much better if I run only one block per multiprocessor.

theMarix · September 2, 2008, 9:13am

That might become difficult given, that it’s neither specified how warps nor how blocks are scheduled. However, feel free to try and share the results. Any gain is a gain after all.

sgratton · September 4, 2008, 3:35pm

Hi there,

I did wonder whether reading float4’s might be better as Alex suggests but apparently this is not the case. I haven’t tried too hard looking for optimal access patterns; I just wanted to understand and avoid the large slowdowns I was sometimes having. Of course it’d be great if there are some simple guidelines to follow that do lead to improved performance. (As theMarix says though, it is not clear how much you can do inter-block over and above using all partitions.)

After the sometimes-serious performance problems I had seen (processing block-columns of matrices – perhaps not too esoteric), I personally was hoping for some guidance in the cuda 2.0 documentation about global memory access over and above coalescing (much as you get for cpu’s with regards to caches, paging and channel interleaving). Maybe Nvidia might consider this in time for 2.1?!

Thanks,
Steven.

Topic		Replies	Views
Effective global memory bandwidth? CUDA Programming and Performance	17	17780	September 18, 2007
Partition camping ? CUDA Programming and Performance	5	11287	March 23, 2010
Global memory access bottleneck CUDA Programming and Performance	8	3674	September 4, 2015
Memory coalescing in one thread CUDA Programming and Performance	17	16876	March 31, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	2301	December 9, 2010
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11746	May 23, 2010
Global memory access latency access latency is as much as 1200 cycles CUDA Programming and Performance	28	17945	August 13, 2009
Putting the GPU at work CUDA Programming and Performance	21	20505	July 5, 2007
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	3160	October 30, 2020
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5433	September 6, 2008

Cuda Memory Bank layout Interleaving, Addressing, Conflicts

Related topics