Cuda Memory Bank layout Interleaving, Addressing, Conflicts

Hi there,

I’m getting strange result with a simple Streambenchmark copy.

So how are the DRAM memory banks of a G8800 GTX arranged?

In detail:

How many banks?
How large?
What is the interleaving factor?
How can I avoid bank conflicts?
How can I probe my hardware to see bank conflicts?
Are there any adresspatterns in the memory pointers which I can rely on?
(e.g. AMD Opteron switches its banked cache by the 4 least bits)

Just for the case, I’m not speaking of shared memory, but of global memory (DRAM).

Thanks!

Johannes

Is this with regard to host->device copies or device memory bandwidth when used by a kernel?

If it is the latter, the only ways to optimize device memory access are to use coalesced reads (see the programming guide) or read using a texture with good spatial locality within each warp. Either of these two methods can net you ~70 GiB/s of memory bandwidth on an 8800 GTX without needed any of the details on the banks etc…

No its only regardin on-device performance.

I get this 70 GB/s, however if I enlarge my vectors from 4MB to 8 MB and larger,I get breakdowns with certain thread to block combinations. Speaking of 1024 to 4096 blocks and 256 threads. So occupancy is 100%.

I’m familiar with this behaviour from other architectures were bank conflicts occur, that motivated my question.

Regards,

Johannes

Interesting. I don’t think I’ve ever observed a block size dependence on the effective memory bandwidth. But then, I’ve never gone looking for it, either. Maybe someone knows more than me.

I’ve wondered this too, 768MB is a kind of strange amount of memory. It’s divided into (a multiple of) three banks in some way. This rules out using the lower bits, I think,

As far as I have found on the net (for what it’s worth) there is 512 Mb on a 256 bit bus & 256 Mb on a 128 bit bus.

Found this on the internet:

8800 GTX will have a 384 bit bus (6 x 64-bit channels)

Anandtech G80 details

But unfortunately:
We would love to delve further into the details of G80’s new memory interface, but NVIDIA isn’t discussing the details of this aspect of their hardware.

Johannes

Hmm, 64 bit = 16 floats = half warp. Were the coalescing rules not for a half warp?

Huh?

64 bit = 2 floats in my book.

As 64bit / 8 = 8 byte

and 4 byte for one float

So we would have 12 floats

ahum… bits != bytes :">

So I guess it will be 2 * how much faster memory is compared to shared ???

The programming guide states that you have 200-300 clock cycles latency till the memory fetch starts.

Shared mem has one clock cycle like registers if you have no bank conflicts.

If you compare bandwidth of the memory to clock cycles of the shaders you get down to 16 floats per clock cycle.

However you must fully use the memory interface then, which I have never achieved.

I meant to write shader instead of shared… I really should learn to type in the morning…

Ok. Still I somewhat answered the question:

If you compare bandwidth of the memory to clock cycles of the shaders you get down to 16 floats per clock cycle.

perhaps I should add … per clock cycle of the shader.

So one float per multiprocessor per clock cycle.

With an ultra you get slightly more: 17,28 floats per shader clock cycle.

What I let out of the picture here is the amount of time the data has to wait in order to synchronize between memory and shader clock.

Please check. Typos happen all the time

well, 16 floats means 1 float for each thread of a half-warp (like I thought before). Now I am not exactly sure if you need to coalesce per warp or half-warp, but at least that could indeed be the answer.

Maybe with the next generation of hardware we will have more data-points to find out the underlying reasons, for now I try to keep things coalesced ;)

Maybe, but unfortunately we speak of all 16 multiprocessors…

But back to my original question.

Do I have to interleave my data to not always hit the same bank over and over again?

Still I haven’t managed to get a proper test case to decide wether yes or no.

Hi there,

On the topic of global memory access, while developing a Cholesky matrix factorization routine for my 640MB 8800GTS last year, I noticed a strange factor of two slowdown that occurred whenever the matrix’s leading dimension (in floats) was a multiple of 20*16. The only multiple of 5 I know of on this card is the number of partitions of memory (it has 5 64-bit partitions, each 128MB, making a 320-bit memory interface in total to 640MB), so this led me to think that it must be global memory access issues that is causing this problem. The key kernel in the code divides the matrix up into 16 by 16 tiles and each block processes one column of tiles using 256 threads.

“Padding” the leading dimension of the global memory array for the matrix by another sixteen floats makes the problem disappear for matrices that are multiples of 2016 but induces it for those whose size%(2016)=19.

The difficulty with such padding is that different CUDA-capable cards have different numbers of memory partitions, all the way from 1 to 6, so the correct pad would be card-specific! Basically one might have to pad out to a row length that is not a multiple of 2,3,4,5 or 6 times 64 floats (in particular the latter ones for the higher-end cards).

If anybody wants to try the code on their card and report the “kernel time” for various #DESIRED_MAT_SIZE’s differing by multiples of 16 (in particular 12288 and 12304 for an 8800GTX) it’d be very interesting to see the results. I wouldn’t be surprised if slowdowns occur on “new” 8800GT/GTS’s and 9800’s at multiples of 1616, on 8800 gtx’s/ultras at multiples of 2416, and on 128-bit lower end cards at multiples of 8*16.

The code is available via a link from:

Cholesky factorization in CUDA

( For cards/machines with not too much memory suitable sizes might be in the vicinity of 6000. Even then you may need to remove stack size limits, “ulimit -s unlimited” on linux, since a large matrix gets made on the CPU stack.).

Some times from my card are:

8800GTS 640MB

5760(=16360): 2.1s
6000(=16
375): 1.4s
6080(=16*380): 2.5s

8000(=16500): 5.7s
8016(=16
501): 3.2s

12160(=16760): 20.9s
12288(=16
768): 11.2s,
12304(=16*769): 11.3s

From thinking about this slowdown, and from the coalescing, alignment requirements and non-coalesced performance hit information in the programming guide (which note is independent of card), my guess is that the memory is interleaved in units of 256 bytes between partitions (and is perhaps passed around in 32 byte chunks). The slowdown here would then occur because four consecutive columns would then be stored in the same partition. (I intend to rewrite the code some time to access the matrix row-wise to see if the problem goes away!)

This is just speculation though, and if mysterious slowdowns such as this one are indeed due to the global memory layout I think it might be really helpful to programmers if Nvidia could provide sufficient details in the performance section ot the programming guide to enable one to avoid/correct for such issues.

Thanks,
Steven.

Thanks,

great input. I’ll try your code on my 8800 GTX and a Tesla with 1,5 GB Memory.
I’m looking forward to the results.

Johannes

Dear Johannes,

Hi there. By any chance have you been able to investigate if there is a periodic slowdown on your cards as well as on my 640MB 8800GTS? It’d be great to know what the situation is. (If you are having any problems with running the code please let me know!)

Thanks a lot,
Steven.

I would but currently I’ve no spare time due to the deployment of our windows hpc 2008 cluster. I’m coming back to Cuda about April 28.

I’ll keep you posted as fast as possible.

Johannes

Hi there,

I’ve been kindly lent an 8800GTX so have been able to test my code on this card too. As perhaps anticipated from the discussion in my previous post, it exhibits a slowdown for matrix sizes that are multiples of 24*16 floats. The slowdowns can be even more extreme than those on the 640MB 8800GTS, up to a factor of 3 it seems.

For matrix-type problems, the simple advice seems to be to avoid accessing matrices tile-by-tile columnwise if possible. This applies even if all global memory accesses are coalesced. Else “pad” the matrix in global memory to a multiple of 16 floats (to maintain coalescing) that is not also a multiple of 2,3,4,5 or 6 times 16 floats. This should avoid card-specific slowdowns. (With future hardware perhaps showing similar behaviour but having different width memory buses it might be worth even just picking a prime multiple of 16 floats.)

I’m sure similar advice will apply for different problems that happen to have similar global memory access patterns. Basically it seems one should imagine memory being interleaved in 256 byte units between partitions and try to avoid only accessing memory that falls into the same partition.

Assuming global memory layout is indeed the issue, perhaps Nvidia wouldn’t mind documenting this to avoid any guesswork?

This seems able to make a major difference to performance for a not-unheard-of memory access pattern, e.g. making a GPU either faster or slower than a quad-core CPU implementation, or having a 640MB 8800 GTS outperform a 8800GTX by a factor of 2, for my Cholesky code.

Some times are:

8800 GTX

5760(=16360=162415) 2.2s
5776(=16
361) 0.9s

11712(=161261) 13.6 s
12160(=16760=168*95) 12.1s

12272(=16767) 8.4s
12288(=16
768=162432) 24.7s (cf the 11.2s 640MB 8800GTS time…)
12304(=16*769) 8.4s

A 1228812288 matrix “padded out” to 1228812304 takes only 8.4 s.

Note also some “ringing” slowdowns at other sub-multiple matrix sizes.

Best,
Steven.