upper limit for memory bandwidth on the device ?

Are the bandwidths (GB/s) obtained by the memcpy (device --> device) the maximum that can be attained??

On the Quadro 5600Fx according to the specs, the Memory Bandwidth is reported to be 76.8 GB/s and the maximum I have seen so far using the memcpy ~ 66 GB/s !!
The same thing holds even for the Tesla c1060, reported Memory Bandwidth - 102 GB/s and the maximum observed using the memcpy ~ 77Gb/s

There’s a considerable discrepancy of - 16% (Quadro) and 32% (c1060) ??

So, is the reported bandwidth purely theoretical (calculations based on the memory speed and so on…) and not really achievable ??..or not ??

any thoughts ?? (from NVIDIA ??)

no ideas from anybody on this ???

Someone previously posted a CUDA kernel that would perform the same task as cudaMemCpy(), but a few percent faster (depending on block size).

But I do not have the link to the thread.

I believe that was the OP himself ;)


I am refering to an older thread that was actually offering a CUDA kernel based replacement for device 2 device cudaMemCpy().

But I can’t seem to find it now.

I thought you were referring to this thread which was also started by the OP.


There’s overhead for signaling and the like that isn’t included in the bandwidth measurement. That’s just measuring what kind of bandwidth you can actually see, not necessarily what’s actually happening on a hardware level.

I’ve measured ~130 GB/s of memory (read) bandwidth using a reduction kernel similar to one in the SDK. The fastest write bandwidth I’ve observed is a simple kernel that fills all array elements with a constant (like std::fill() in C++) and that hit ~70 GB/s. These measurements were collected on a GTX 280 with a theoretical peak bandwidth of 141.7 GB/s.

Given these figures, your memcpy() measurements seem reasonable for a kernel that reads and writes the same amount of data.

Are you referring to this one - http://forums.nvidia.com/index.php?showtopic=85562

by alex dubinsky ??

I had tested alex’s version of the memcpy - on the different paltforms (tesla c870, tesla c1060…), the improvements are not uniformly 20% and also they are more prominent only in the case of odd data sizes…

I had also observed this issue of memcpy() slower for particular odd data sizes (very prominently on the QUadro 5600Fx), i have posted this here -

this is what i Nico was referring to…

In that case as nico has pointed out this was not reproducible on different cards…

but for me on the Quadro 5600fx particularly for a data size of - 402653184 bytes (384 MB and its multiples) for that particular 3-way split kernel .i.e a 3-way out-of-place split of an array containing 50,331,648 floats - I observed the speed of the memcpy dropping by around 20% to 58 GB/sec (from its usual peak of 65 GB/sec, on the quadro) - I was not able to get any confirmation of this - and no further information as to why this was happening ???


I had a question regarding this - u mentioned that you measured the memory bandwidth only for the read, using a reduction-type kernel ??

so this kernel was just reading in data (into shared memory ??) and you would be timing that right - by timing the kernel execution ?

in that case there wouldn’t be any check to actually see if the kernel actually read-in the data as required… ?? is this the case or do you time the kernel differently ??

it would be really great if you could provide the code for this test (in some sort of stripped down form…) ??

thanks in advance

The measured time was for the whole reduction operation, which includes:

  1. reading the data

  2. writing out per-block sums to global memory

  3. copying per-block sums from global memory to host memory

  4. summing the per-block sums in host memory

so the actual reading of data in step 1) is actually slightly faster than my measurement suggests (this is probably negligible with large arrays).

The kernel in question is here:


It’s similar to the reduce5() function used in the reduction example in the CUDA SDK. If you run that example with larger input sizes (e.g. 100M integers) then you should be able to see ~130 GB/s on a GTX 280.

Yes, just your clickable link is broken. Correct link is HERE http://forums.nvidia.com/index.php?showtopic=85562

thanks a lot…I ran these tests, to measure the bandwidth on my Quadro and they seem to match my earlier measurements (max of ~66 GB/sec) :D , still abt 16% lesser than the value in the sepcifications…so I guess its fair to assume that, this is the max bandwidth that can be practically attained (till data) ?? and use this as a benchmark and compare performances of other data-access dominated operations…(instead on comparing against the value in the specifications…)

and any ideas about the kink in the performance of the device-device to memcpy - the bandwidth drops to ~58 GB/s (which is otherwise arnd~ 66GB/sec) on the Quadro for data size = 50,331,648 floats (and its multiples) ?? does this have to do with some specifics of the mempcy implementation

thanks again…

thanks for that…

i have edited the link in my previous post also ;)