Benchmarking Different Memory Access Patterns

Curious to see how the different memory subsystems on the T10P worked, I took a memory benchmark that MisterAnderson42 posted over in the public CUDA forums and made several changes. Files are attached, and any bugs you find are probably mine, not his. :)

The benchmark tests 3 kinds of access patterns: random, broadcast, and linear. The source array is 256 ints. “Random” has all threads in a block read the source array with some random permutation. In “Broadcast” mode, all threads in a warp read the same array entry. In “Linear” mode, thread i reads element i. Bandwidth is computed for each pattern using const, shared, texture, and global memory as the source, with 10 kernel calls, 256 threads * 500 blocks, 10000 reads per thread.

(Being too lazy to try to do the data analysis inside the CUDA program, I wrote a python script to parse the output from the profiler and make a table.)

Here’s an output session showing the results for our 8800 GTX and T10P in the same computer:

[volsung@grad08 t10p]$ nvcc -o read_test2 read_test2.cu

[volsung@grad08 t10p]$ CUDA_PROFILE=1 ./read_test2 0

Selecting Device 0: GeForce 8800 GTX

[volsung@grad08 t10p]$ python read_test2_summary.py cuda_profile.log

(All values in GB/sec)

-------------------------------------------------

|   memtype |    random | broadcast |    linear |

-------------------------------------------------

|     const |      38.2 |     295.0 |      38.6 |

|    global |       6.3 |       5.0 |      55.5 |

|     shmem |     113.5 |     246.9 |     247.7 |

|       tex |      35.0 |      72.9 |      72.8 |

-------------------------------------------------

Total Read: 51 GB

[volsung@grad08 t10p]$ CUDA_PROFILE=1 ./read_test2 1

Selecting Device 1: GT200

[volsung@grad08 t10p]$ python read_test2_summary.py cuda_profile.log

(All values in GB/sec)

-------------------------------------------------

|   memtype |    random | broadcast |    linear |

-------------------------------------------------

|     const |      45.2 |     326.8 |      45.2 |

|    global |       5.5 |      68.6 |      48.0 |

|     shmem |     130.1 |     272.4 |     276.0 |

|       tex |      25.9 |     123.7 |     123.9 |

-------------------------------------------------

Total Read: 51 GB

Nothing here is unexpected based on the Programming Guide, but it is impressive to see the new memory transaction hardware tear through the global broadcast case. The random case still does poorly since the permutation is over 256 elements, and memory transactions are are in groups of 32 elements (128 bytes). I expected it to do a little better than it did on the T10P since on average each memory transaction should be able service more than one thread, due to a birthday paradox sort of argument. Perhaps I should read section 5.1.2.1 more closely…

[BTW, NVIDIA, please, please pester your forum admins to permit .cu file attachments! .py would be nice too.]
read_test2_summary.py.txt (2.01 KB)
read_test2.cu.txt (5.7 KB)

Without having read your source code (will try to do on monday) I see you have less GB/s on T10P when accessing global memory in a linear fashion than on 8800GTX. That seems strange. It also seem like you should always use a texture, since a linear access of that does 123.9 GB/s vs 48.0 GB/s for the global memory case…

An algorithm of mine that accesses global memory linearly does perform a bit faster on T10P compared to 8800GTX (when not stressing the hardware at all that is, when I throw enough blocks to the hardware I expect up to 10x more performance, but the controlling code is not written yet) , so it looks like something might be wrong with that number. Also for me the bandwith test gives a bit higher numbers for device-device on T10P than on 8800GTX.

But this makes me want to go to work today to try texture memory instead of global :)

Actually, I’m pretty sure that is the texture cache working its magic. The source array is only 4096 bytes, and so should fit neatly into the texture cache.

But yeah, I don’t quite understand why the global memory bandwidth in the linear read is lower for the T10P in this benchmark. bandwidthTest device-to-device shows 84.5 GB/sec for the T10P and 71.1 GB/sec for the 8800 GTX.

Here is my output with -arch=sm_13 (I get a tiny bit faster values when I pass that option):

(All values in GB/sec)

-------------------------------------------------

|   memtype |    random | broadcast |    linear |

-------------------------------------------------

|     const |      77.3 |     559.1 |      77.3 |

|    global |       5.0 |      61.5 |      42.8 |

|     shmem |     228.4 |     472.5 |     475.0 |

|       tex |      38.8 |     184.3 |     183.5 |

-------------------------------------------------

Total Read: 51 GB

And on 8800GTX (without sm_13) :

(All values in GB/sec)

-------------------------------------------------

|   memtype |    random | broadcast |    linear |

-------------------------------------------------

|     const |      38.1 |     292.4 |      38.3 |

|    global |       6.3 |       5.0 |      54.7 |

|     shmem |     113.7 |     247.7 |     246.2 |

|       tex |      35.0 |      72.8 |      72.7 |

-------------------------------------------------

Total Read: 51 GB

There are quite big differences in my output compared to yours. Here is my devicequery output:

There are 2 devices supporting CUDA

Device 0: "GeForce 8800 GTX"

  Major revision number:                         1

  Minor revision number:                         0

  Total amount of global memory:                 804585472 bytes

  Number of multiprocessors:                     16

  Number of cores:                               128

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          262144 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    1.35 GHz

  Concurrent copy and execution:                 No

Device 1: "GT200"

  Major revision number:                         1

  Minor revision number:                         3

  Total amount of global memory:                 4294705152 bytes

  Number of multiprocessors:                     30

  Number of cores:                               240

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 16384

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          262144 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    1.51 GHz

  Concurrent copy and execution:                 Yes

Test PASSED

I have the other T10P model:

Device 1: "GT200"

  Major revision number: �  �  �  �  �  �  �  �  �  �  �  �  1

  Minor revision number: �  �  �  �  �  �  �  �  �  �  �  �  3

  Total amount of global memory: �  �  �  �  �  �  �  �  1073479680 bytes

  Number of multiprocessors: �  �  �  �  �  �  �  �  �  �  24

  Number of cores: �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  192

  Total amount of constant memory: �  �  �  �  �  �  �  65536 bytes

  Total amount of shared memory per block: �  �  �  16384 bytes

 Â Total number of registers available per block: 16384

  Warp size: �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  32

  Maximum number of threads per block: �  �  �  �  �  512

  Maximum sizes of each dimension of a block: �   512 x 512 x 64

  Maximum sizes of each dimension of a grid: �  �  65535 x 65535 x 1

  Maximum memory pitch: �  �  �  �  �  �  �  �  �  �  �  �   262144 bytes

  Texture alignment: �  �  �  �  �  �  �  �  �  �  �  �  �  �  256 bytes

  Clock rate: �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �  �   1.08 GHz

  Concurrent copy and execution: �  �  �  �  �  �  �  �  Yes

With “only” 192 stream processors and 1.08 GHz clock, it’s going to be almost half the speed of your T10P for shared and const memory.

(Wow, apologies for all the question mark symbols… Weird cut and paste action going on here.)

Just for the fun of it: for simple axpy (not CUBLAS, my own trivial kernel) and 1D input arrays of length > 1M elements (128 threads with as many blocks as it works out to), I’m seeing 114 GB/s on the T10P. I also have the “slower” version. This is 1.5x faster than on my 8800 GTX, and for this simple test (no striding etc going on), there is no need for using the SHMEM. Back in the days, we called this “streaming bandwidth”…

In single precision, this factors out to 19 GFLOP/s, and 9.5 GFLOP/s in double; the GB/s rate is obviously the same.

I’m curious as to what you achieve.

x=x+\alpha y, len(x) = len(y) = 1050625

kernel:

__global__ void saxpy(float *y, float *x, float alpha)

{

  int idx = blockDim.x*blockIdx.x+threadIdx.x;

  y[idx] += alpha*x[idx];

}

invocation:

dim3 grid;

dim3 block;

block.x = 128;

int N = y->N;

grid.x = (int)ceil(N/(double)block.x);

saxpy<<<grid,block>>>(y->data, x->data, alpha);