upper limit for memory bandwidth on the device ?

NCC-1701D · July 3, 2009, 7:43am

Are the bandwidths (GB/s) obtained by the memcpy (device → device) the maximum that can be attained??

Because,
On the Quadro 5600Fx according to the specs, the Memory Bandwidth is reported to be 76.8 GB/s and the maximum I have seen so far using the memcpy ~ 66 GB/s !!
The same thing holds even for the Tesla c1060, reported Memory Bandwidth - 102 GB/s and the maximum observed using the memcpy ~ 77Gb/s

There’s a considerable discrepancy of - 16% (Quadro) and 32% (c1060) ??

So, is the reported bandwidth purely theoretical (calculations based on the memory speed and so on…) and not really achievable ??..or not ??

any thoughts ?? (from NVIDIA ??)

NCC-1701D · July 6, 2009, 5:35am

no ideas from anybody on this ???

cbuchner1 · July 6, 2009, 9:35am

Someone previously posted a CUDA kernel that would perform the same task as cudaMemCpy(), but a few percent faster (depending on block size).

But I do not have the link to the thread.

Nico · July 6, 2009, 10:29am

I believe that was the OP himself ;)

N.

cbuchner1 · July 6, 2009, 4:23pm

I am refering to an older thread that was actually offering a CUDA kernel based replacement for device 2 device cudaMemCpy().

But I can’t seem to find it now.

Nico · July 6, 2009, 5:19pm

I thought you were referring to this thread which was also started by the OP.

N.

tmurray · July 6, 2009, 6:19pm

There’s overhead for signaling and the like that isn’t included in the bandwidth measurement. That’s just measuring what kind of bandwidth you can actually see, not necessarily what’s actually happening on a hardware level.

nbell · July 6, 2009, 8:48pm

I’ve measured ~130 GB/s of memory (read) bandwidth using a reduction kernel similar to one in the SDK. The fastest write bandwidth I’ve observed is a simple kernel that fills all array elements with a constant (like std::fill() in C++) and that hit ~70 GB/s. These measurements were collected on a GTX 280 with a theoretical peak bandwidth of 141.7 GB/s.

Given these figures, your memcpy() measurements seem reasonable for a kernel that reads and writes the same amount of data.

NCC-1701D · July 7, 2009, 4:47am

Are you referring to this one - http://forums.nvidia.com/index.php?showtopic=85562

by alex dubinsky ??

I had tested alex’s version of the memcpy - on the different paltforms (tesla c870, tesla c1060…), the improvements are not uniformly 20% and also they are more prominent only in the case of odd data sizes…

I had also observed this issue of memcpy() slower for particular odd data sizes (very prominently on the QUadro 5600Fx), i have posted this here -

this is what i Nico was referring to…

In that case as nico has pointed out this was not reproducible on different cards…

but for me on the Quadro 5600fx particularly for a data size of - 402653184 bytes (384 MB and its multiples) for that particular 3-way split kernel .i.e a 3-way out-of-place split of an array containing 50,331,648 floats - I observed the speed of the memcpy dropping by around 20% to 58 GB/sec (from its usual peak of 65 GB/sec, on the quadro) - I was not able to get any confirmation of this - and no further information as to why this was happening ???

NCC-1701D · July 7, 2009, 4:53am

@nbell…

I had a question regarding this - u mentioned that you measured the memory bandwidth only for the read, using a reduction-type kernel ??

so this kernel was just reading in data (into shared memory ??) and you would be timing that right - by timing the kernel execution ?

in that case there wouldn’t be any check to actually see if the kernel actually read-in the data as required… ?? is this the case or do you time the kernel differently ??

it would be really great if you could provide the code for this test (in some sort of stripped down form…) ??

thanks in advance

nbell · July 7, 2009, 5:02pm

The measured time was for the whole reduction operation, which includes:

reading the data
writing out per-block sums to global memory
copying per-block sums from global memory to host memory
summing the per-block sums in host memory

so the actual reading of data in step 1) is actually slightly faster than my measurement suggests (this is probably negligible with large arrays).

The kernel in question is here:

http://code.google.com/p/thrust/source/bro…a/reduce.inl#87

It’s similar to the reduce5() function used in the reduction example in the CUDA SDK. If you run that example with larger input sizes (e.g. 100M integers) then you should be able to see ~130 GB/s on a GTX 280.

cbuchner1 · July 7, 2009, 6:11pm

Yes, just your clickable link is broken. Correct link is HERE http://forums.nvidia.com/index.php?showtopic=85562

NCC-1701D · July 8, 2009, 4:18am

thanks a lot…I ran these tests, to measure the bandwidth on my Quadro and they seem to match my earlier measurements (max of ~66 GB/sec) :D , still abt 16% lesser than the value in the sepcifications…so I guess its fair to assume that, this is the max bandwidth that can be practically attained (till data) ?? and use this as a benchmark and compare performances of other data-access dominated operations…(instead on comparing against the value in the specifications…)

and any ideas about the kink in the performance of the device-device to memcpy - the bandwidth drops to ~58 GB/s (which is otherwise arnd~ 66GB/sec) on the Quadro for data size = 50,331,648 floats (and its multiples) ?? does this have to do with some specifics of the mempcy implementation

thanks again…

NCC-1701D · July 8, 2009, 4:19am

thanks for that…

i have edited the link in my previous post also ;)

Topic		Replies	Views
Quadro 4000 Bandwidth The device to device bandwidth obtained with CUDA Programming and Performance	8	3513	March 7, 2011
bandwidth test CUDA Programming and Performance	9	18994	March 24, 2009
Lower then expected bandwidth on C2050 CUDA Programming and Performance	11	9080	October 26, 2010
Question about bandwidth test CUDA Programming and Performance	8	297	April 2, 2024
How to calculate memory bandwidth from device properties ? CUDA Programming and Performance	11	5385	June 20, 2015
Bandwidth measurement Theortical bandwidth vs BandwidthTest(SDK) results CUDA Programming and Performance	4	1544	May 30, 2011
How to Implement Performance Metrics in CUDA C/C++ Technical Blog	20	851	March 11, 2020
Performance test sharedmemory <-> globalmemory CUDA Programming and Performance	2	3931	May 30, 2008
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13629	June 2, 2008
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16288	January 30, 2011

upper limit for memory bandwidth on the device ?

Related topics