I’m trying to measure the rate of random memory access on GPUs. Here’s how I’m measuring it:
- The host creates an array, A, with 256 million integers and copies it to the device. Each integer in A is a random number between 0 and (256M - 1)
- The host creates another array, B, with 16K integers and copies it to the device. Each integer in B is also a random number between 0 and (256M - 1).
- The host starts a timer and launches a kernel. The kernel executes one thread for each element in B and performs 10 dependent and random memory accesses for it. The following code gets executed for an element in B:
for(d = 0; d < 10; d ++) { B[i] = A[B[i]]; }
When the kernel finishes executing, the host stops the timer. The random access rate is (16K * 10) / timer_seconds. The approximate values for some GPUs are as follows. A GTX 690 gets 950 million random accesses per second, and a Tesla K20C gets 1000 million.
A GTX 690 has 192 GB/s memory bandwidth per GPU. Assuming that global memory loads are done at 32-byte granularity, I’m only seeing .95 * 32 = 30.4 GB/s of memory bandwidth.
Here are some questions:
- Is the assumption that global memory loads are done at 32-byte granularity correct? I read at some places that loads are done at 128-byte granularity unless a compile-time option is provided. But that was not very clear.
- Can I do some optimization to increase the random access rate?
- Why is my observed memory bandwidth so small compared to the best-case (sequential) memory bandwidth?
Original StackOverflow post: http://stackoverflow.com/questions/25776307/global-memory-access-performance-with-random-access