Hi there,
On the topic of global memory access, while developing a Cholesky matrix factorization routine for my 640MB 8800GTS last year, I noticed a strange factor of two slowdown that occurred whenever the matrix’s leading dimension (in floats) was a multiple of 20*16. The only multiple of 5 I know of on this card is the number of partitions of memory (it has 5 64-bit partitions, each 128MB, making a 320-bit memory interface in total to 640MB), so this led me to think that it must be global memory access issues that is causing this problem. The key kernel in the code divides the matrix up into 16 by 16 tiles and each block processes one column of tiles using 256 threads.
“Padding” the leading dimension of the global memory array for the matrix by another sixteen floats makes the problem disappear for matrices that are multiples of 2016 but induces it for those whose size%(2016)=19.
The difficulty with such padding is that different CUDA-capable cards have different numbers of memory partitions, all the way from 1 to 6, so the correct pad would be card-specific! Basically one might have to pad out to a row length that is not a multiple of 2,3,4,5 or 6 times 64 floats (in particular the latter ones for the higher-end cards).
If anybody wants to try the code on their card and report the “kernel time” for various #DESIRED_MAT_SIZE’s differing by multiples of 16 (in particular 12288 and 12304 for an 8800GTX) it’d be very interesting to see the results. I wouldn’t be surprised if slowdowns occur on “new” 8800GT/GTS’s and 9800’s at multiples of 1616, on 8800 gtx’s/ultras at multiples of 2416, and on 128-bit lower end cards at multiples of 8*16.
The code is available via a link from:
Cholesky factorization in CUDA
( For cards/machines with not too much memory suitable sizes might be in the vicinity of 6000. Even then you may need to remove stack size limits, “ulimit -s unlimited” on linux, since a large matrix gets made on the CPU stack.).
Some times from my card are:
8800GTS 640MB
5760(=16360): 2.1s
6000(=16375): 1.4s
6080(=16*380): 2.5s
8000(=16500): 5.7s
8016(=16501): 3.2s
12160(=16760): 20.9s
12288(=16768): 11.2s,
12304(=16*769): 11.3s
From thinking about this slowdown, and from the coalescing, alignment requirements and non-coalesced performance hit information in the programming guide (which note is independent of card), my guess is that the memory is interleaved in units of 256 bytes between partitions (and is perhaps passed around in 32 byte chunks). The slowdown here would then occur because four consecutive columns would then be stored in the same partition. (I intend to rewrite the code some time to access the matrix row-wise to see if the problem goes away!)
This is just speculation though, and if mysterious slowdowns such as this one are indeed due to the global memory layout I think it might be really helpful to programmers if Nvidia could provide sufficient details in the performance section ot the programming guide to enable one to avoid/correct for such issues.
Thanks,
Steven.