How to calculate memory bandwidth from device properties ?

I would like to automatically switch to the GPU with the highest memory bandwidth in my system (because our algorithm scale with memory bandwidth).

How can I calculate now the memory bandwidth from the device properties (returned by the respective CUDA API function) ? I did not find any field in the returned structure.

The memoryBusWidth and memoryClockRate parameters should give you an estimate of peak theoretical performance (multiply the two together), which could be used for comparison.

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0

You could also do something similar to what is in the bandwidthTest sample code, for a specific measurement, which could be used for comparison.

This is incorrect advice, the bandwidthTest only measures PCI bandwidth or device to device bandwidth.

It does not measure the GPU RAM bandwidth itself.

Device to device bandwidth as measured by bandwidth test on a single device is an extremely good proxy for comparison of the memory bandwidth of different GPUs.

I’m not referring to the PCI bandwidth measurement.

If device A has higher memory bandwidth than device B, then the device to device bandwidthTest measurement on device A will be higher than the device to device bandwidthTest measurement on device B.

thanks for all responses !

i completely forgot about the ‘bandwidthTest’ sample from the Cuda SDK, i will take the code from there …

Ok, now I see what you are getting at.

First I was assuming he was swapping cards, but now that I read it again he has multiple gpu in same system.

Then again how does device to device memory transfer work ? If via PCI express then still shitty.

But I suppose there is a special sli connector between different cards ? In that case maybe device to device will not be bottlenecked by PCI express.

Also if you think about it… device 2 device test still doesn’t make any sense.

The bandwidth will be bottlenecked by the slowest card.

How to tell which card is the slow one ?

It will still require individual testing of each card seperately.

However you seem to indicate device 2 device transfer on a single card.

I still think your theory is incorrect.

The bandwidth you are seeing even on device 2 device in single card is the PCI express bandwidth and not GPU <-> GPU ram bandwidth.

No kernel is ever executed for the bandwidth test.

It will only show a difference if one of the cards is slower than PCI express bandwidth.

For now I assume both cards are faster than PCI express bandwidth… so both will be bottlenecked the same way for PCI express bandwidth… at least that’s my expected outcome of this test for his system… ;)

Don’t bother with it… there is no usuable code to copy from it… there is not even a kernel… if there were I would have copied it myself for my own bandwidth test… GPU RAM bandwidth test that is…

The nvidia bandwidth tool is basically useless for testing GPU RAM performance.

There are plenty of properties that you can use to estimate bandwidth to/from and on the device.

Here are some of the relevant properties:

Device bandwidth is: bus width * clock rate * f(ram type)

I concur with both recommendations given by txbob. These methods will not necessarily give the actual memory bandwidth, but will return results that should be proportional to the actual bandwidth, which should be all that is needed to sort the GPU in the system by performance. Doing the actual measurements has the advantage of incorporating the performance impact of ECC (or any future changes in memory technology that may invalid a straightforward comparison by product of memory clock and memory interface width). Of course, memory performance is also a function of the access pattern, so it might make sense to use an actual relevant kernel from the application instead for ranking purposes.

As for the bandwidthTest application that ships with CUDA, I believe it measures the performance of cudaMemcpy (…, cudeMemcpyDeviceToDevice), which does map to a kernel running on the GPU; this can easily be confirmed with the CUDA profiler.

@njuffa,txbob: i agree with your remarks/recommandations.

All i need is a value which is proportional to the memory bandwidth of the respective GPU, so that i can switch to the GPU which is the ‘strongest’ for our algorithms (image processing stuff like optical flow, which is in 90% of the cases scaling with memory bandwidth).

Currently we have a heuristic to switch to the GPU with the highest number of CUDA cores (with an architecture-dependent weighting factor for the cores, e.g. a Maxwell-generation core is assumed to be 40% more ‘powerful’ than a Kepler-core).

But i noticed that it is the wrong strategy for some cases. E.g. i have a GTX 960 and GTX 770, and by this heuristic it switches to the GTX 960 which has half the memory bandwidth of the GTX 770 … so i have to change the strategy.

Both you previous and your planned approach make assumptions about the major bottleneck (FLOPS before, now memory bandwidth). Running a relevant actual kernel from the app for the ranking would have the advantage that the ranking remains valid should the bottleneck shift across a large spectrum of GPUs. Of course there may not be such a single representative kernel, or it exists, it may be too cumbersome to use for a quick check at app startup.

cudaMemcpyDeviceToDevice is not a driver api. I’d recommend the driver api for more control.

There is however: cuMemcpyDtoD.

I am not aware that these functions use kernels… that’s kinda odd… if calling an cuda api… would suddenly execute it’s own kernels ;)

Perhaps this might lead to some conflicts or other confusions in future… at least the kernel code is hidden from end user I’d think ? or the kernel code wasn’t usuable or something.

Futher more the profiler has very rarely worked on my system/gt 520… only time it worked was with cuda 4.0 or so… currently cuda 6.5 has problems with deep learning networks code or something.
also installation into visual studio didn’t go completely not sure way… maybe out of disk space at the time or so or maybe not.

I’ll wait for cuda 7 to sort things out.