Memory-intensive benchmarks ? Benchmarks that would provide benefit if there were higher bandwidth..

I’d like to know if there were any benchmarks that are memory bandwidth limited…
It seems that many cuda benchmarks are compute bounded.
Is it possible to share memory intensive benchmarks ?

Thanks ~

I wonder if the GPUmemtest tool could be used or modified as one. It is almost by definition bandwidth limited.

The definition itself sounds great!

Hope others reply also :)

All the level 1 and level 2 BLAS functions are bandwidth limited (e.g. SAXPY or SGEMV).

Sparse matrix-vector multiplication (SpMV) is also bandwidth limited (see here).

From my own experience and the people with whom I have spoken, it may be that a majority of kernels are bandwidth limited. In my own quantum Monte Carlo code, a good example is 3D B-spline evaluation. It essentially does an interpolation of a 3D function with piecewise cubic polynomials. For each evaluation, 64 coefficients must be read from global memory and only about 128 FLOPs are needed. All reads and writes are coalesced, but it’s still quite bandwidth-limited. If you would like to take a look, see the einspline library (For the latest version, check out from the subversion repository). Other examples include a rank-1 update to a matrix inverse.


Ken Esler

Agreed. Every single kernel in my entire MD application is memory bandwidth bound. There is a good reason for the massive amount of advice on these forums suggesting that memory accesses patterns be optimized for first.