Peak bandwidth to memory on Tesla M2050

I am trying to get close to the peak bandwidth to memory on the Tesla M2050 using the following kernel:

__global__ void 

__launch_bounds__(NUMTHREADS, NUMBLOCKS/14)

copystride(double *list, int n, double *copy){

  int tid = threadIdx.x + blockIdx.x*blockDim.x;

  int stride = blockDim.x*gridDim.x;

  for(int i=tid; i < n; i = i + stride)

    copy[i] = list[i]; 

  1. The kernel is called with NUMTHREADS=1024 and NUMBLOCKS=14.

  2. list and copy are allocated using cudaMalloc(). So they are 256 byte aligned (see “NVIDIA CUDA Best Practices”).

  3. n is a large multiple of NUMTHREADS*NUMBLOCKS close to 100 million.

  4. it tid%32==0, we are talking about the first thread of a warp. The memory touched by that thread is 256 byte aligned.

  5. Each warp reads 256 bytes from list and writes 256 bytes to copy. Since a Fermi cache line is 128 bytes, that is two cache lines read and two cache lines written.

The peak bandwidth I realize with this kernel is only 90 GB/s against the theoretical maximum of around 150 GB/s.

“NVIDIA CUDA Best Practices” (Section realizes bandwidth of 120 GB/s for GTX 280 against the theoretical peak bandwidth of 140 GB/s. The program used in that section gives only 78 GB/s on M2050.

What should one do to get closer to the peak bandwidth of Tesla M2050? What should one do to get past the realized bandwidth of GTX280?

I am using cudaEventRecord() with appropriate synchronization for timing. I think the numbers might improve if I used clock() from CUDA runtime inside the kernel, but not by much.

To get optimal memory throughput for a Fermi-class device, one wants a fair number of thread blocks going per multi-processor, and then use a lot of thread blocks. A good heuristic for the size of the thread block is 384 threads, this allows 6 thread blocks to run concurrently on each multi-processor (which can support up to 1536 threads). For grid size, it is best to use as many blocks as are needed so each thread handles a single array element. If the array is larger, just use 65520 thread blocks (this guarantees that the number of thread blocks equally divides by the number of multiprocessors for many common GPU configurations, in particular multiprocessor counts of 14, 15, and 16). Putting these pieces together, one winds up with code like this (here, NUM is the number of array elements to be copied)

__global__ void STREAM_Copy (double* src, double* dst, int len)


    int stride = gridDim.x * blockDim.x;

    int tid = blockDim.x * blockIdx.x + threadIdx.x;

    for (int i = tid; i < len; i += stride) {

        dst[i] = src[i];



dim3 dimBlock(384);

int threadBlocks = (NUM + (dimBlock.x - 1)) / dimBlock.x;

if (threadBlocks > 65520) threadBlocks = 65520;

dim3 dimGrid(threadBlocks);

STREAM_Copy<<<dimGrid,dimBlock>>>(d_src, d_dst, NUM);

njuffa: Thanks for replying. I tried what you suggested.

The kernel was slightly modified:

__global__ void 

__launch_bounds__(NUMTHREADS, 1536/NUMTHREADS)

copystride(double *list, int n, double *copy){

  int tid = threadIdx.x + blockIdx.x*blockDim.x;

  int stride = blockDim.x*gridDim.x;

  for(int i=tid; i < n; i = i + stride)

    copy[i] = list[i]; 


First I set up the number of threads per block and the number of blocks as follows (notice the use of flag):

int n=100*1000*1000;

  int nb;

  if(flag == 0)

    nb = NUMBLOCKS;


    int nb = n/NUMTHREADS;

    if(NUMTHREADS*nb < n)


    if(nb > 65520)

      nb = 65520;


Next I called the copying kernel.

copystride<<<nb, NUMTHREADS>>>(dlist, n, dcopy);

The time for this kernel was saved in telapsed (time in milliseconds). The bandwidth was printed out as follows:

cout<<"Bandwidth to memory = "<<2*n*8.0/(telapsed/1000)/1e9<<" GB/s"<<endl;

The program was saved in and compiled using the following command.

nvcc --ptxas-options=-v -arch=sm_20  -DNUMTHREADS=384 -DNUMBLOCKS=56 -o copy.exe

If flag==0 the number of blocks is 56 which keeps all the SMs busy. There are 4 thread blocks per SM and there are 14 SMs. If flag==1 the number of blocks is 65520.

In both cases the bandwidth was the same. It was equal to 88 GB/s.

  1. I am not sure how having 65520 blocks instead of 56 can help. It may make for more portable code, but 56 is enough to keep all the SMs busy. Having too many blocks may create overhead when the register file has to be changed to bring in a new block.

  2. Why is the program still well short of peak bandwidth?

As I stated, a high number of thread blocks (and thus as little work as possible per thread) is necessary to reach the best possible throughput on a Fermi-class GPU. I will prepare a simple copy benchmark and post it here, then we can compare notes for an apples-to-apples comparison.

What happens if you create more outstanding memory requests through partial unrolling of the loop, e.g. by preceeding it by [font=“Courier New”]#pragma unroll 4[/font]?

Njuffa: Thanks for offering to prepare a simple copy benchmark. I would very much like to understand how the M2050 memory controllers work.

Tera: I thought about unrolling the loop explicitly. That would save some branch overhead, but in the presence of so much thread level parallelism and the very large latency of 400 to 800 cycles to DRAM, I would expect the branch overhead to have minimal impact. It may be worth a try.

I am not certain how the M2050 executes the instruction stream. Does it have a branch prediction unit for unrolling loops dynamically? Does it do dependency analysis? In the copy program the inner loop is a load and a store and the different iterations are independent. An x86 core would detect the independence dynamically and keep several loads (10 cache line loads to be precise on Nehalem) in flight at the same time. On a GPU, I think that would not make so much sense because of the emphasis on thread level parallelism. Even if it did, having to keep the threads in a warp on the same instruction may make branch prediction and out of order execution impractical to implement.

On x86, the stores would all have to be streaming stores (MOVNTPD) for optimal bandwidth. My impression is that all stores on GPU are streaming stores, but I could be wrong. Are the Fermi M2050 L1 and L2 caches write-back or write-through?

Please find my little test app in the attached file I hope I didn’t screw up anything as I was programming this on the double. I did a quick test on one Windows and one Linux system. Simply build with: nvcc -arch=sm_20 -o dcopy

Below is the output when running with an M2050 (ECC on) on a 64-bit Linux system. With ECC turned off, one would see somewhat higher throughput.

~/[…] $ dcopy -n5000000

dcopy: operating on vectors of 5000000 doubles (= 4.000e+07 bytes)

dcopy: using 384 threads per block, 13021 blocks

dcopy: mintime = 0.719 msec throughput = 111.29 GB/sec

~/[…] $ dcopy -n10000000

dcopy: operating on vectors of 10000000 doubles (= 8.000e+07 bytes)

dcopy: using 384 threads per block, 26042 blocks

dcopy: mintime = 1.380 msec throughput = 115.94 GB/sec

~/[…] $ dcopy -n20000000

dcopy: operating on vectors of 20000000 doubles (= 1.600e+08 bytes)

dcopy: using 384 threads per block, 52084 blocks

dcopy: mintime = 2.734 msec throughput = 117.04 GB/sec

~/[…] $ dcopy -n30000000

dcopy: operating on vectors of 30000000 doubles (= 2.400e+08 bytes)

dcopy: using 384 threads per block, 65520 blocks

dcopy: mintime = 4.213 msec throughput = 113.94 GB/sec (5.48 KB)

Unrolling in this case is not about the branch overhead, but about creating more outstanding memory requests.

The compiler will reorder reads before writes inside the loop, so that you can have (in this example) four independent reads on the fly before a dependent write is encountered.

As far as I know, Nvidia GPUs don’t do out-of-order execution (so far?).

njuffa: Let me first note that all the runs described in this post are on M2050 on Amazon EC2. M2050 has 14 SMs.

First I will give the timing results for my kernel, which follows:

__global__ void 

__launch_bounds__(NUMTHREADS, 1536/NUMTHREADS)

copystride(double *list, int n, double *copy){

  int tid = threadIdx.x + blockIdx.x*blockDim.x;

  int stride = blockDim.x*gridDim.x;

  for(int i=tid; i < n; i = i + stride)

    copy[i] = list[i]; 


This kernel is practically identical to the kernel you use. For timing I used cuda events (recommended by the Best Practices, Section 2.1).

The value of n used was

n=384*56*4000;//n is 86016000

This makes for perfect load balancing across the threads.

With 384 threads per block and 56 thread blocks, I measured a bandwidth of 89 GB/s.

With 384 threads per block and 56000 thread blocks, I measured a bandwidth of 100 GB/s. Both these numbers very consistent.

56 and 56000 are natural numbers for the number of thread blocks because M2050 has 14 SMs each of which can hold 4 thread blocks (14 X 4 = 56) and the chosen value of n gives perfect load balancing.

I accept your point that having a large number of thread blocks makes for better bandwidth. tera: “#pragma unroll 4” made no difference. I did not check the PTX to see if the compiler actually unrolled. My guess is it did not. When I unrolled explicitly, things got slightly slower.

Next I compiled your

nvcc --ptxas-options=-v -arch=sm_20 -o dcopy.exe

When I ran it on the M2050, I go the following output:

[root@ip-10-17-129-108 bq-gpu]# dcopy.exe

dcopy: operating on vectors of 10000000 doubles (= 8.000e+07 bytes)

dcopy: using 384 threads per block, 26042 blocks

dcopy: mintime = 1.735 msec  throughput = 92.22 GB/sec

On other runs the throughput reported was as low as 82 GB/s.

I noticed that you were using a CPU timer in your second() function. So following Best Practices, Section 2.1, I inserted cudaThreadsSynchronize() as the very first line of your second() (for the unix version only). I got the following output.

[root@ip-10-17-129-108 bq-gpu]# dcopy.exe

dcopy: operating on vectors of 10000000 doubles (= 8.000e+07 bytes)

dcopy: using 384 threads per block, 26042 blocks

dcopy: mintime = 2.571 msec  throughput = 62.23 GB/sec

On other runs the reported throughput was as high as 71 GB/s.

  1. I think using a power of 10 for the size of the vector is not good for load balance.

  2. The peak bandwidth of M2050 is calculated using its clock speed of 1.15 GHz, two transfers per cycle and 512 bits in the width of the memory interface (Best Practices, Section 2.2.1). It works out to about 150 GB/s.

  3. One would expect better bandwidth from a program that does more reading than writing. So I wrote a kernel which adds all the numbers in a list. With 56 blocks I got a bandwidth of 95 GB/s. With 56000 blocks I got a bandwidth of 108 GB/s. Note that with 56000 blocks, the kernel writes 56000*384 doubles to global memory so that each thread may save its result. That is a lot of writing.

  4. My guess is one may approach the peak bandwidth if the program only reads to a register but never writes. I tried to write some inline assembly using PTX to do that, but could not get it to work. My impression is nvcc has only a limited implementation of inline assembly.

  5. I would like to understand why having 56000 blocks instead of 56 helps. With 56 blocks, each thread block can be scheduled on an SM and the number of warps is (384/32)*56=672. Perhaps that is not enough to hide the latency to DRAM. I tried 560 blocks. Surely 6720 warps is enough to hide latency to DRAM. Yet the performance was only 90 GB/s and not 100 GB/s that I got with 56000 warps. What is going on? Why does using so many blocks help? I would expect that every a time block finishes running and another blocks is brought in, there is overhead to set up the registers of the new block. Why does that overhead not hurt? How much is that overhead? The NVIDIA documentation is completely silent on this issue.

How does one turn ECC (error correction for memory) on and off?

nvidia-smi --gpu= --ecc-config={0|1}

where an ECC configuration of 1 corresponds to “ON” and an ECC configuration of 0 corresponds to “OFF”. Note that setting the ECC configuration with nvidia-smi merely stages the setting; the selected ECC setting takes effect at the next reboot.

nvidia-smi --gpu= -r

reports the ECC configuration currently in effect, as well as the staged configuration that will take effect with the next reboot.

With regard to the dcopy code, please note that cudaThreadSynchronize() is called as part of CHECK_LAUNCH_ERROR() which is invoked right after the kernel and before the call to second(). There is thus no need to add additional calls to cudaThreadSynchronize().

It is correct that selecting vector lengths that exactly balance the load across SMs can achieve slightly higher throughput. I purposefully didn’t pick those values to avoid the impression of cherry-picking the most favorable results. Instead I selected various large numbers to demonstrate what range of throughputs could be expected for large copies on an M2050. I forgot to mention that I compiled the code with CUDA 3.2 (final), and ran on matching drivers. While I don’t believe this matters much (if at all), I thought I should mention it for completeness.