Cuda Memory Bank layout Interleaving, Addressing, Conflicts

JHHPC · April 4, 2008, 9:39am

Hi there,

I’m getting strange result with a simple Streambenchmark copy.

So how are the DRAM memory banks of a G8800 GTX arranged?

In detail:

How many banks?
How large?
What is the interleaving factor?
How can I avoid bank conflicts?
How can I probe my hardware to see bank conflicts?
Are there any adresspatterns in the memory pointers which I can rely on?
(e.g. AMD Opteron switches its banked cache by the 4 least bits)

Just for the case, I’m not speaking of shared memory, but of global memory (DRAM).

Thanks!

Johannes

MisterAnderson42 · April 4, 2008, 12:46pm

Is this with regard to host->device copies or device memory bandwidth when used by a kernel?

If it is the latter, the only ways to optimize device memory access are to use coalesced reads (see the programming guide) or read using a texture with good spatial locality within each warp. Either of these two methods can net you ~70 GiB/s of memory bandwidth on an 8800 GTX without needed any of the details on the banks etc…

JHHPC · April 4, 2008, 1:15pm

No its only regardin on-device performance.

I get this 70 GB/s, however if I enlarge my vectors from 4MB to 8 MB and larger,I get breakdowns with certain thread to block combinations. Speaking of 1024 to 4096 blocks and 256 threads. So occupancy is 100%.

I’m familiar with this behaviour from other architectures were bank conflicts occur, that motivated my question.

Regards,

Johannes

MisterAnderson42 · April 4, 2008, 1:35pm

Interesting. I don’t think I’ve ever observed a block size dependence on the effective memory bandwidth. But then, I’ve never gone looking for it, either. Maybe someone knows more than me.

wumpus · April 4, 2008, 6:48pm

I’ve wondered this too, 768MB is a kind of strange amount of memory. It’s divided into (a multiple of) three banks in some way. This rules out using the lower bits, I think,

DenisR · April 5, 2008, 6:39am

As far as I have found on the net (for what it’s worth) there is 512 Mb on a 256 bit bus & 256 Mb on a 128 bit bus.

JHHPC · April 7, 2008, 6:40am

Found this on the internet:

8800 GTX will have a 384 bit bus (6 x 64-bit channels)

Anandtech G80 details

But unfortunately:
We would love to delve further into the details of G80’s new memory interface, but NVIDIA isn’t discussing the details of this aspect of their hardware.

Johannes

DenisR · April 7, 2008, 7:36am

Hmm, 64 bit = 16 floats = half warp. Were the coalescing rules not for a half warp?

JHHPC · April 7, 2008, 7:57am

Huh?

64 bit = 2 floats in my book.

As 64bit / 8 = 8 byte

and 4 byte for one float

So we would have 12 floats

DenisR · April 7, 2008, 9:30am

ahum… bits != bytes :">

So I guess it will be 2 * how much faster memory is compared to shared ???

JHHPC · April 7, 2008, 11:37am

The programming guide states that you have 200-300 clock cycles latency till the memory fetch starts.

Shared mem has one clock cycle like registers if you have no bank conflicts.

If you compare bandwidth of the memory to clock cycles of the shaders you get down to 16 floats per clock cycle.

However you must fully use the memory interface then, which I have never achieved.

DenisR · April 7, 2008, 2:33pm

I meant to write shader instead of shared… I really should learn to type in the morning…

JHHPC · April 7, 2008, 3:00pm

Ok. Still I somewhat answered the question:

If you compare bandwidth of the memory to clock cycles of the shaders you get down to 16 floats per clock cycle.

perhaps I should add … per clock cycle of the shader.

So one float per multiprocessor per clock cycle.

With an ultra you get slightly more: 17,28 floats per shader clock cycle.

What I let out of the picture here is the amount of time the data has to wait in order to synchronize between memory and shader clock.

Please check. Typos happen all the time

DenisR · April 7, 2008, 5:54pm

well, 16 floats means 1 float for each thread of a half-warp (like I thought before). Now I am not exactly sure if you need to coalesce per warp or half-warp, but at least that could indeed be the answer.

Maybe with the next generation of hardware we will have more data-points to find out the underlying reasons, for now I try to keep things coalesced ;)

JHHPC · April 8, 2008, 6:50am

Maybe, but unfortunately we speak of all 16 multiprocessors…

But back to my original question.

Do I have to interleave my data to not always hit the same bank over and over again?

Still I haven’t managed to get a proper test case to decide wether yes or no.

sgratton · April 9, 2008, 1:18pm

Hi there,

On the topic of global memory access, while developing a Cholesky matrix factorization routine for my 640MB 8800GTS last year, I noticed a strange factor of two slowdown that occurred whenever the matrix’s leading dimension (in floats) was a multiple of 20*16. The only multiple of 5 I know of on this card is the number of partitions of memory (it has 5 64-bit partitions, each 128MB, making a 320-bit memory interface in total to 640MB), so this led me to think that it must be global memory access issues that is causing this problem. The key kernel in the code divides the matrix up into 16 by 16 tiles and each block processes one column of tiles using 256 threads.

“Padding” the leading dimension of the global memory array for the matrix by another sixteen floats makes the problem disappear for matrices that are multiples of 2016 but induces it for those whose size%(2016)=19.

The difficulty with such padding is that different CUDA-capable cards have different numbers of memory partitions, all the way from 1 to 6, so the correct pad would be card-specific! Basically one might have to pad out to a row length that is not a multiple of 2,3,4,5 or 6 times 64 floats (in particular the latter ones for the higher-end cards).

If anybody wants to try the code on their card and report the “kernel time” for various #DESIRED_MAT_SIZE’s differing by multiples of 16 (in particular 12288 and 12304 for an 8800GTX) it’d be very interesting to see the results. I wouldn’t be surprised if slowdowns occur on “new” 8800GT/GTS’s and 9800’s at multiples of 1616, on 8800 gtx’s/ultras at multiples of 2416, and on 128-bit lower end cards at multiples of 8*16.

The code is available via a link from:

Cholesky factorization in CUDA

( For cards/machines with not too much memory suitable sizes might be in the vicinity of 6000. Even then you may need to remove stack size limits, “ulimit -s unlimited” on linux, since a large matrix gets made on the CPU stack.).

Some times from my card are:

8800GTS 640MB

5760(=16360): 2.1s
6000(=16375): 1.4s
6080(=16*380): 2.5s

8000(=16500): 5.7s
8016(=16501): 3.2s

12160(=16760): 20.9s
12288(=16768): 11.2s,
12304(=16*769): 11.3s

From thinking about this slowdown, and from the coalescing, alignment requirements and non-coalesced performance hit information in the programming guide (which note is independent of card), my guess is that the memory is interleaved in units of 256 bytes between partitions (and is perhaps passed around in 32 byte chunks). The slowdown here would then occur because four consecutive columns would then be stored in the same partition. (I intend to rewrite the code some time to access the matrix row-wise to see if the problem goes away!)

This is just speculation though, and if mysterious slowdowns such as this one are indeed due to the global memory layout I think it might be really helpful to programmers if Nvidia could provide sufficient details in the performance section ot the programming guide to enable one to avoid/correct for such issues.

Thanks,
Steven.

JHHPC · April 9, 2008, 1:39pm

Thanks,

great input. I’ll try your code on my 8800 GTX and a Tesla with 1,5 GB Memory.
I’m looking forward to the results.

Johannes

sgratton · April 15, 2008, 9:15am

Dear Johannes,

Hi there. By any chance have you been able to investigate if there is a periodic slowdown on your cards as well as on my 640MB 8800GTS? It’d be great to know what the situation is. (If you are having any problems with running the code please let me know!)

Thanks a lot,
Steven.

JHHPC · April 15, 2008, 10:44am

I would but currently I’ve no spare time due to the deployment of our windows hpc 2008 cluster. I’m coming back to Cuda about April 28.

I’ll keep you posted as fast as possible.

Johannes

sgratton · April 22, 2008, 3:16pm

Hi there,

I’ve been kindly lent an 8800GTX so have been able to test my code on this card too. As perhaps anticipated from the discussion in my previous post, it exhibits a slowdown for matrix sizes that are multiples of 24*16 floats. The slowdowns can be even more extreme than those on the 640MB 8800GTS, up to a factor of 3 it seems.

For matrix-type problems, the simple advice seems to be to avoid accessing matrices tile-by-tile columnwise if possible. This applies even if all global memory accesses are coalesced. Else “pad” the matrix in global memory to a multiple of 16 floats (to maintain coalescing) that is not also a multiple of 2,3,4,5 or 6 times 16 floats. This should avoid card-specific slowdowns. (With future hardware perhaps showing similar behaviour but having different width memory buses it might be worth even just picking a prime multiple of 16 floats.)

I’m sure similar advice will apply for different problems that happen to have similar global memory access patterns. Basically it seems one should imagine memory being interleaved in 256 byte units between partitions and try to avoid only accessing memory that falls into the same partition.

Assuming global memory layout is indeed the issue, perhaps Nvidia wouldn’t mind documenting this to avoid any guesswork?

This seems able to make a major difference to performance for a not-unheard-of memory access pattern, e.g. making a GPU either faster or slower than a quad-core CPU implementation, or having a 640MB 8800 GTS outperform a 8800GTX by a factor of 2, for my Cholesky code.

Some times are:

8800 GTX

5760(=16360=162415) 2.2s
5776(=16361) 0.9s

11712(=161261) 13.6 s
12160(=16760=168*95) 12.1s

12272(=16767) 8.4s
12288(=16768=162432) 24.7s (cf the 11.2s 640MB 8800GTS time…)
12304(=16*769) 8.4s

A 1228812288 matrix “padded out” to 1228812304 takes only 8.4 s.

Note also some “ringing” slowdowns at other sub-multiple matrix sizes.

Best,
Steven.

Topic		Replies	Views
Putting the GPU at work CUDA Programming and Performance	21	20175	July 5, 2007
Effective global memory bandwidth? CUDA Programming and Performance	17	17571	September 18, 2007
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29323	April 23, 2010
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4698	June 22, 2011
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5078	September 6, 2008
Using Shared Memory in CUDA C/C++ Technical Blog	36	1987	October 8, 2020
Cuda program results are always zero in HW, correct in EMU? CUDA Programming and Performance	35	11154	May 23, 2010
Efficient use of shared memory CUDA Programming and Performance	29	4462	December 2, 2019
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	703	September 29, 2018

Cuda Memory Bank layout Interleaving, Addressing, Conflicts

Related topics