How to calculate memory bandwidth from device properties ?

HannesF99 · June 15, 2015, 3:43pm

I would like to automatically switch to the GPU with the highest memory bandwidth in my system (because our algorithm scale with memory bandwidth).

How can I calculate now the memory bandwidth from the device properties (returned by the respective CUDA API function) ? I did not find any field in the returned structure.

Robert_Crovella · June 15, 2015, 3:49pm

The memoryBusWidth and memoryClockRate parameters should give you an estimate of peak theoretical performance (multiply the two together), which could be used for comparison.

[url]http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0[/url]

You could also do something similar to what is in the bandwidthTest sample code, for a specific measurement, which could be used for comparison.

Skybuck · June 16, 2015, 1:54am

This is incorrect advice, the bandwidthTest only measures PCI bandwidth or device to device bandwidth.

It does not measure the GPU RAM bandwidth itself.

Robert_Crovella · June 16, 2015, 2:58am

Device to device bandwidth as measured by bandwidth test on a single device is an extremely good proxy for comparison of the memory bandwidth of different GPUs.

I’m not referring to the PCI bandwidth measurement.

If device A has higher memory bandwidth than device B, then the device to device bandwidthTest measurement on device A will be higher than the device to device bandwidthTest measurement on device B.

HannesF99 · June 16, 2015, 8:34am

thanks for all responses !

i completely forgot about the ‘bandwidthTest’ sample from the Cuda SDK, i will take the code from there …

Skybuck · June 17, 2015, 12:54am

Ok, now I see what you are getting at.

First I was assuming he was swapping cards, but now that I read it again he has multiple gpu in same system.

Then again how does device to device memory transfer work ? If via PCI express then still shitty.

But I suppose there is a special sli connector between different cards ? In that case maybe device to device will not be bottlenecked by PCI express.

Also if you think about it… device 2 device test still doesn’t make any sense.

The bandwidth will be bottlenecked by the slowest card.

How to tell which card is the slow one ?

It will still require individual testing of each card seperately.

However you seem to indicate device 2 device transfer on a single card.

I still think your theory is incorrect.

The bandwidth you are seeing even on device 2 device in single card is the PCI express bandwidth and not GPU <-> GPU ram bandwidth.

No kernel is ever executed for the bandwidth test.

It will only show a difference if one of the cards is slower than PCI express bandwidth.

For now I assume both cards are faster than PCI express bandwidth… so both will be bottlenecked the same way for PCI express bandwidth… at least that’s my expected outcome of this test for his system… ;)

Skybuck · June 17, 2015, 1:05am

Don’t bother with it… there is no usuable code to copy from it… there is not even a kernel… if there were I would have copied it myself for my own bandwidth test… GPU RAM bandwidth test that is…

The nvidia bandwidth tool is basically useless for testing GPU RAM performance.

allanmac · June 17, 2015, 1:05am

There are plenty of properties that you can use to estimate bandwidth to/from and on the device.

Here are some of the relevant properties:

Device bandwidth is: bus width * clock rate * f(ram type)

njuffa · June 17, 2015, 6:36am

I concur with both recommendations given by txbob. These methods will not necessarily give the actual memory bandwidth, but will return results that should be proportional to the actual bandwidth, which should be all that is needed to sort the GPU in the system by performance. Doing the actual measurements has the advantage of incorporating the performance impact of ECC (or any future changes in memory technology that may invalid a straightforward comparison by product of memory clock and memory interface width). Of course, memory performance is also a function of the access pattern, so it might make sense to use an actual relevant kernel from the application instead for ranking purposes.

As for the bandwidthTest application that ships with CUDA, I believe it measures the performance of cudaMemcpy (…, cudeMemcpyDeviceToDevice), which does map to a kernel running on the GPU; this can easily be confirmed with the CUDA profiler.

HannesF99 · June 17, 2015, 8:27am

@njuffa,txbob: i agree with your remarks/recommandations.

All i need is a value which is proportional to the memory bandwidth of the respective GPU, so that i can switch to the GPU which is the ‘strongest’ for our algorithms (image processing stuff like optical flow, which is in 90% of the cases scaling with memory bandwidth).

Currently we have a heuristic to switch to the GPU with the highest number of CUDA cores (with an architecture-dependent weighting factor for the cores, e.g. a Maxwell-generation core is assumed to be 40% more ‘powerful’ than a Kepler-core).

But i noticed that it is the wrong strategy for some cases. E.g. i have a GTX 960 and GTX 770, and by this heuristic it switches to the GTX 960 which has half the memory bandwidth of the GTX 770 … so i have to change the strategy.

njuffa · June 17, 2015, 4:56pm

Both you previous and your planned approach make assumptions about the major bottleneck (FLOPS before, now memory bandwidth). Running a relevant actual kernel from the app for the ranking would have the advantage that the ranking remains valid should the bottleneck shift across a large spectrum of GPUs. Of course there may not be such a single representative kernel, or it exists, it may be too cumbersome to use for a quick check at app startup.

Skybuck · June 20, 2015, 1:42am

cudaMemcpyDeviceToDevice is not a driver api. I’d recommend the driver api for more control.

There is however: cuMemcpyDtoD.

I am not aware that these functions use kernels… that’s kinda odd… if calling an cuda api… would suddenly execute it’s own kernels ;)

Perhaps this might lead to some conflicts or other confusions in future… at least the kernel code is hidden from end user I’d think ? or the kernel code wasn’t usuable or something.

Futher more the profiler has very rarely worked on my system/gt 520… only time it worked was with cuda 4.0 or so… currently cuda 6.5 has problems with deep learning networks code or something.
also installation into visual studio didn’t go completely not sure way… maybe out of disk space at the time or so or maybe not.

I’ll wait for cuda 7 to sort things out.

Topic		Replies	Views
Bandwidth is too slow so cudaMemcpy() takes too long. CUDA Programming and Performance	15	7519	December 12, 2012
Quadro 4000 Bandwidth The device to device bandwidth obtained with CUDA Programming and Performance	8	3518	March 7, 2011
GPU Memory how to find the GPU memory bandwidth CUDA Programming and Performance	10	17678	June 23, 2007
how to relate device ID to CPU cores/ PCIe ID in NUMA system CUDA Programming and Performance	18	8032	June 26, 2023
Bandwidht Usage CUDA Programming and Performance	16	8895	October 30, 2008
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Driver for GTX 1080 Ti CUDA Programming and Performance	21	19633	June 22, 2017
Using bandwidthTest, D2D performance exceeds theoretical bandwidth CUDA Programming and Performance cuda	1	396	October 27, 2022
upper limit for memory bandwidth on the device ? CUDA Programming and Performance	13	11246	July 8, 2009
how to improve the memory allocation rate,data transfer rate from host to device and device to host CUDA Programming and Performance	9	5266	February 26, 2010

How to calculate memory bandwidth from device properties ?

Related topics