I have an index based application that I am trying to accelerate on a GTX480. The problem is that I first need to read a dozen index entires from global memory. I have organized my index to align my accesses on 128 bit boundaries and I have used coallesing when possible but there is a certain unpredictable and irregular memory access pattern that is inherent in my application. I have found that my initial index access time on the GTX480 is identical to the time it takes my host computer to perform the same SDRAM accesses. Thus I have concluded that I am memory access latency bound.
My question is; “is there any way to exploit the six 64-bit memory controllers on the GTX480 to accelerate the global memory I/O”?
These controllers are touted in the specs with pride but I haven’t found any discussion of their advantage over a single controller. If my indexes were spread across the entire global memory would the controllers be able to perform SDRAM row and column activations in parallel? If so, how do I direct cudamalloc() to allocate global memory in a particular SDRAM bank?
Any insight into how the memory controllers are mapped to the global memory and how they operate in general would be greatly appreciated!
Well you do have global memory partitions that can be subject to “partition camping” ( just like bank conflicts in shared memory). So for some problems with certain problem sizes you can see a ~50 % drop in bandwidth due to many SMs reading/writing to the same memory partition.
So if the GTX480 has 384-bit interface that means you have 384/64 = 6 partitions which means that for example bytes 0-255 and 1535-1791are in partition 1 etc,… ( if my 1:20 pm math is correct…)
Are you saying that the 6 memory controllers are each responsible for 6 partitions which are each interleaved every 256 bytes?
Are you saying that when the 32 threads of a warp each attempt to read a different location in global memory and say 13 of them all are addresses within the same partition, then excessive “partition camping” is created compared to ballanced load with 5 threads reading locations in each partition?
Fermi GPUs are reportedly [topic=“171575”]immune to partition camping[/topic].
Tim: it seems you have way too few concurrent memory accesses to saturate GPU memory, so you are indeed latency bound.
GPU memory latency is much worse than CPU memory latency.
Rather than a dozen pending transaction per memory partition, you need a few thousands of pending transactions to hide the latency. Those transactions originate from thousand of threads.
If you do not have this amount of parallelism in your application, there is no hope in accelerating it with a GPU…