We recently purchased some Quadro 4000 GPUs at our company and doing some initial tests, we found a rather disturbing result. By executing the bandwidthTest program that comes with the cuda toolkit we obtain a device-to-device bandwidth of approximately 45 Gb/s, while the peak bandwidth of the specification is 89.6 Gb/s. That seems a big difference. Since the applications we have to execute on that GPU are mainly driven by the rate at which global memory is accessed, this has raised some concerns among us.
My question is: is that an expected result? If not, has anyone any experience on how to optimize that data transfer rate to make it more similar to the peak bandwidth value?
We are using RHL 5.5 and the 3.2 version of the cuda toolkit for linux.
I am attaching the source for a little test app, dcopy.cu, that measures device memory throughput by copying doubles (i.e. 8-byte quantities). If I recall correctly the Quadro 4000 is a Fermi-based card, i.e. compute capability 2.0. Based on that, simply build the app with
nvcc -arch=sm_20 -o dcopy dcopy.cu
The app can then be invoked as follows:
dcopy -n
Note: there is no space between the switch name and the switch value. To approximate peak throughput, you’d want to make the vector long. I would suggest -n20000000, but I do not know how much memory the Quadro 4000 has, so if that is too large I suggest choosing smaller arguments until the allocation fits. The timing methodology of this app follows STREAM, in that the copy kernel is executed multiple times and the minimum time from among all kernel invocations is reported. dcopy.cu (5.48 KB)
By any chance did you install any of the Quadro 4000 on a Dell Precission T7400? My quaestion might be off topic. I am asking because I want to upgrade my graphics card and I have read in the Dell documentation there is a limit of 1.5GB regarding the video memory. I am not sure how does this impact the performance of the card but it might be the issue.
Thank you and please post back about your findings. I mean were you able to get the peak bandwith of 89.6GB/s. I am also using RHEL 5.5 and I hope to use this card for some CUDA based applications.
The SDK app bandwidthTest uses cudaMemcpy(…,cudaMemcpyDeviceToDevice) to determine the memory throughput on the device. Checking with the driver team I found that there is a known issue with suboptimal throughput achieved by this API call. I would therefore suggest cross-checking the device memory throughput using the little test app I posted, using a long test vector, e.g. -n20000000. The results should tell us whether the performance is within the expected range.
Thank you for your advice. First I was hopping the one who opened the thread will respond. In any case I want to report that Quadro 4000 (PNY) works on my workstation (Dell Precission T7400 - RHEL 5.6 64-bit) . There are some glitches but at least your code reports a bandwidth of 79.6 GB/s as opposed to 47.7 GB/s from “bandwidthTest”. For what is worth the card seems to use PCIe 16x Gen 1 as opposed to PCIe 16x Gen 2. I don’t understand yet what is the cause of this but I will report back.
However, there is still a 11.1% degrade in performance compared with what is reported in the specification. This is not acceptable and I believe Nvidia should release the tools they have used to measure the bandwidth. This will help tremendously the community and alleviate any further headaches regarding these numbers. I should probably open a new thread with this issue. Is anybody from nvidia reading our posts?
The stated bandwidth of 89.6 GB/sec is the theoretical maximum, 1400 MHz x (256 / 8) bytes x 2 = 89.6 GB/sec. I do not have a Quadro 4000 at my disposal, but based on my measurements on a C2050 that I have here, you should expect roughly 75% of that as DCOPY bandwidth with ECC enabled, 85% of that with ECC disabled. This correspond to DCOPY throughput of 67.2 GB/sec (with ECC on) and 76.2 GB/sec (with ECC off), respectively, for the Quadro 4000. Based on the reported throughput of 79.6 GB/sec, the performance of your card is as expected.
Thank you njuffa. Now it is clear. It would have saved me some headaches if these details were reported in the specs sheet. I understand the desire to report the best numbers but 89.6 GB/s is not what one gets in reality and this should be stated clearly. Anyhow, it is good we got an understanding of what is going on.
Hi,
And sorry for not having answered before, but we had some problems with the machine that hosts the quadro GPUs that we just resolved today. Thanks njuffa for the clarification and for the piece of code to test the bandwidth. As Pingo, we are also obtaining 79.6 Gb/s with the Quadro 4000 GPU. So it seems that everything is working as expected.
Hi,
In the attachment please find a plot of read / write / read-and-write throughput of my GTX 480, cuda 3.2 RC, driver 260.19.06.
The green arrow shows the theoretical peak. The red arrow shows the value reported by cuda SDK bandwidthtest.
The data come from njuffa test for n=50 000 000 (read and write) and from its sligtly modified version for read-only and write-only kernels.
Read-only was achieved by reading a double to a volatile shared double tab[1].
Njuffa’s program was modified to test all (!) blocksiezs from 1 to 1024.
This is very instructive to see all the results !
My conclusions:
bandwidthtest is a [censored out] and NVidia should improve it considerably or remove it from SDK. I know many people dismayed by the
inter-GPU transfer rates it reports, especially in Fermi-based cards, and all off them, all of us felt kind of deceived if not by NVidia, then
at least by card manufacurers. If the specification says “177,4 GB/s”, why do we get 118 GB/s?
throughput depends on whether you read or write or do both of them. Thus, it depends on how read and write instructions are “interleaved”.
There is no universal gauge to measure the bandwidth.
It’s funny that you can achieve peak write performance for just 1 warp per block (I guess the GPU launches 8 such blocks per multiprocessor).
One “critical number of warps per block” is 16, the other is 24. However, 24 may have something to do with gridsize/blocksize algorithm:
if there are less than 24 warps / block the number of blocks is fixed at 65520, and then it starts to decrease.
Njuffa is not quite right: the blocksize of 12*32 = 344 warps is not particularly distinguished from others, though on average it performs quite well.
One more remark for nvidia. Bandwithtest measures the bandwith with MiB (2^20), but incorrectly calls the units MB, whereas documentation uses “true” MB (10^6).
This chnages the result by a few per cent. Frustrating if you try to use it as a reference value in tests.