Global memory access performance with random access

I’m trying to measure the rate of random memory access on GPUs. Here’s how I’m measuring it:

  1. The host creates an array, A, with 256 million integers and copies it to the device. Each integer in A is a random number between 0 and (256M - 1)
  2. The host creates another array, B, with 16K integers and copies it to the device. Each integer in B is also a random number between 0 and (256M - 1).
  3. The host starts a timer and launches a kernel. The kernel executes one thread for each element in B and performs 10 dependent and random memory accesses for it. The following code gets executed for an element in B:
    for(d = 0; d < 10; d ++) {
        B[i] = A[B[i]];
    }
    

When the kernel finishes executing, the host stops the timer. The random access rate is (16K * 10) / timer_seconds. The approximate values for some GPUs are as follows. A GTX 690 gets 950 million random accesses per second, and a Tesla K20C gets 1000 million.

A GTX 690 has 192 GB/s memory bandwidth per GPU. Assuming that global memory loads are done at 32-byte granularity, I’m only seeing .95 * 32 = 30.4 GB/s of memory bandwidth.

Here are some questions:

  1. Is the assumption that global memory loads are done at 32-byte granularity correct? I read at some places that loads are done at 128-byte granularity unless a compile-time option is provided. But that was not very clear.
  2. Can I do some optimization to increase the random access rate?
  3. Why is my observed memory bandwidth so small compared to the best-case (sequential) memory bandwidth?

Original StackOverflow post: http://stackoverflow.com/questions/25776307/global-memory-access-performance-with-random-access

Ah you have hit my most favorite topic “random access memory speeds” (the forum is totally fokked so don’t know what’s going on with that but anyway…) I can only see one line while writing this comment, some buttons gone, and page not centralized anymore… I also don’t like the black to white switching when going to new page, hopefully forum improved in future… anyway).

I am a little surprised that only 16k elements are “seeked” and then 10 times. So this is only 160.000 memory look ups in total. Seems a bit low. Why so low ?

1 seconds has 1000 millseconds which is 1000000 microseconds, which is 1.000.000.000 nanoseconds.

Main ram access speed probably somewhere between 100 nanosecond (for typical PC ram) and maybe 8 to 800 times as slow on GPU not sure about that.

But a couple of million memory look ups at least code-wise would seem a bit better to at least test accurately bandwidth/memory look ups per second.

Now to the bandwidth issue. Either the bandwidth is limited by “instruction per second”. A memory lookup is also an instruction. If the device cannot issue enough instructions the memory look ups will be slow.

However I think you already spotted the most likely problem. Memory lookups are done in 128 bytes per lookup. So this would divide your expected results by 128 bytes / 4 bytes = 32 times. (I can’t remember if it was 128 bytes or 128 bits, I think bits would be more likely so I think you may be wrong about the byte thingy… if it was bits, then divide by 8, so that would mean 32 times / 8 = 4 times. So your expectation must be adjusted by divinding by 4.

Now back to what you are seeing: 192 GB/sec you say… you are seeing 30 GB/sec (this would roughly indicate 4 times slower which is more or less the 128 bit case. If it was 128 byte case it would be much worse 192/32 = 6 GB/sec.

Correct prediction: 192/4 = 48 GB/sec.

Not sure why it’s falling a bit short at 18 Gb/sec, could be additional overhead for instruction sequences or maybe it’s the dual lookup.

First it looks up B[I] and then it looks up A. So that’s a dual lookup I think. There is also some writing involved, that could cause some further stalls perhaps. All in all not to bad.
192/8 = 24 GB/sec for dual lookup… what you are seeing is a bit better than that… probably some caching effects are in play.

The actual usefull memory bandwidth is closer to 3.5 GB/sec. I’d advice re-reading the manuals to be sure about how it looks up memory, I think you may be way off here.
Also even if you were correct, it would be better to make an array which uses elements of 128 bytes (1024 bit) instead of 4 bytes (32 bit). Then all memory look ups would be “aligned at 128 byte boundaries” if that’s what you on about ;)

Also I wrote my own RAM test for CUDA/GPUs. It’s probably not really a generic test, though it’s probably good enough to give some indications. It tests blocks of 8000 elements per thread. It does some cycling between blocks to mimic caching behaviour or locality and such… so not fully random, but most realistic for what I would have used the GPU for if the GPU didn’t suck so much :)

http://www.skybuck.org/CUDA/RAMTest/

Give some versions a try… would be interesting to know the results ;) so posts them here then we can have a looksy at those too ;)

You are measuring latency, not bandwidth.

I think the topic is “global” “random” “memory” “access” “performance”. These terms could be interpreted differently. I at least interpret “global” + “random” as meaning non-linear and non-local. So no to very limited caching effects, though with his described test there may still be some caching possible not sure. “Memory” being main ram and not shared memory or anything like that.
“Access” + “Performance” basically comes down to “seek” “speed”. Which basically implies “maximum number of random integers per second seeked”.

Now the question is what is the limiting factor for “global random access memory”. My assumption is “latency” is the bottleneck. However this is not necessarily true. It could also be bottlenecked by instructions per second, or bandwidth per second. Hence my request for a profiler or other bottlenecking information that the gpu may be able to provide. Which raises an interesting question to the original poster: Have you examined your code(s) with the profiler ? Was it any use ?

As a quick understanding check, if memory reads cannot be coalesced by a half-warp of threads, doesn’t that serialize them? I think the lack of coalesced memory would be the biggest bottleneck here.