How to get peak rate with simple opeartion Question about performance optimization

We know that device to device band width of Quadro FX 5600 is 80GBs/ peak.

In my opinion the transfer operation includes two operations

  • read from global source memory to register (or local shared)

  • write value from register to the global destination memory

So actually some simple operation like add a constant to the source memory and write result to the global memory should yields the same bandwidth . However when i try a simple kernel

template<class T>

__global__ void cuvppMul_C1( T* g_idata, T* g_odata, const T s) 


  const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

  g_odata[idx] = g_idata[idx] * s;


and measure the bandwidth, with len = 1 << 22

nIters = 10000, num_threads = 128

dim3  threads(num_threads, 1, 1);

dim3  grid(len/num_threads, 1, 1);

cuvppMul_C1<T><<< grid, threads >>>(d_idata, d_odata, 4.5f);


CUT_SAFE_CALL( cutResetTimer( timer)); 

 // execute the kernel

for (int i=0; i < nIters; ++i){

        CUT_SAFE_CALL( cutStartTimer( timer));

        cuvppMul_C1<T><<< grid, threads >>>(d_idata, d_odata, 4.5f);


        CUT_SAFE_CALL( cutStopTimer( timer));


 runTime = cutGetAverageTimerValue(timer);

 printf("Average time: %f ms\n", runTime);

 printf("Bandwidth:    %f GB/s\n\n", (len * sizeof(T)) / (runTime * 1.0e6));

I can only get 30GB. Why i can not get the peak rate. How can i reach the peak rate.

Any idea is appreciated

Have you checked the resulting PTX code? Consider you’re actually gaining almost half the nominal Quadro bandwidth, and with such a simple kernel to run, I think that you should check the assembly code to see if NVCC hadn’t pushed out dirty or unbalanced code.

Nevertheless, I’ll try to benchmark your scheme without using templates (even if templates are specialized at compile time…for what I know).

Sorry if I missed it, but what is T in your test? If T is a 32-bit wide variable (int or float) you should easily hit 70GB/s.

If T is a 128-bit wide type such as float4 or uint4, then 30GB/s is about all you can achieve, despite full coalescing. I’ve gotten better performance by reading float4’s from textures.

T is float and int or like you say 32 but wide. But the peak rate is 30GB not 70G.

I don’t think template make any different

Ok, now that I tried running your code, I see the problem. You are only counting lensizeof(T) bytes of throughput, but your kernel performs lensizeof(T) reads and len*sizeof(T) writes = twice as many bytes transferred.

For reference, here is a full file that can be compiled directly by “nvcc -o test” for anyone who wants to repeat the test. It has been modified to use event timers and to print the throughput in binary GiB/s instead of the scientific notation GB/s also.

Edit: I get

Average time: 0.449416 ms

Bandwidth: 69.534752 GiB/s

when I run this code on my 8800 GTX.

#include <stdio.h>

#  define CUDA_SAFE_CALL( call) do {                                         \

    cudaError err = call;                                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",        \

                __FILE__, __LINE__, cudaGetErrorString( err) );              \

    exit(EXIT_FAILURE);                                                      \

    } } while (0)

#ifdef NDEBUG

#define CUT_CHECK_ERROR(errorMessage)


 #  define CUT_CHECK_ERROR(errorMessage) do {                                 \

    cudaThreadSynchronize();                                                \

    cudaError_t err = cudaGetLastError();                                    \

    if( cudaSuccess != err) {                                                \

        fprintf(stderr, "Cuda error: %s in file '%s' in line %i : %s.\n",    \

                errorMessage, __FILE__, __LINE__, cudaGetErrorString( err) );\

        exit(EXIT_FAILURE);                                                  \

    } } while (0)


template <class T>

__global__ void cuvppMul_C1( T* g_idata, T* g_odata, const T s)


 const unsigned int idx = threadIdx.x + blockIdx.x * blockDim.x;

 g_odata[idx] = g_idata[idx] * s;


int main()


    int len = 1 << 22;

    int num_threads = 128;

    int nIters = 10000;

   float *d_idata, *d_odata;

    CUDA_SAFE_CALL( cudaMalloc((void**)&d_idata, sizeof(float)*len) );

    CUDA_SAFE_CALL( cudaMalloc((void**)&d_odata, sizeof(float)*len) );

   dim3  threads(num_threads, 1, 1);

    dim3  grid(len/num_threads, 1, 1);

    cuvppMul_C1<float><<< grid, threads >>>(d_idata, d_odata, 4.5f);

   cudaEvent_t start, end;

    CUDA_SAFE_CALL( cudaEventCreate(&start) );

    CUDA_SAFE_CALL( cudaEventCreate(&end) );

    CUDA_SAFE_CALL( cudaEventRecord(start, 0) );

    // execute the kernel

    for (int i=0; i < nIters; ++i)


        cuvppMul_C1<float><<< grid, threads >>>(d_idata, d_odata, 4.5f);


    CUDA_SAFE_CALL( cudaEventRecord(end, 0) );

    CUDA_SAFE_CALL( cudaEventSynchronize(end) );

   float runTime;

    CUDA_SAFE_CALL( cudaEventElapsedTime(&runTime, start, end) );

    runTime /= float(nIters);

    printf("Average time: %f ms\n", runTime);

    printf("Bandwidth:    %f GiB/s\n\n", (2 * len * sizeof(float)) / (runTime * 1.0e-3 * 1024*1024*1024));


You are right.

I don’t understand the underlying mechanism of CUDA, but from my point of view, why can’t we make the read/write operation parallelized-able like piple-line operations, so that the read process of one block occurs the same time with write operation of previous block that can double speed and reduce latency between read and write operation between blocks

The interleaving you describe does occur. Multiple reads can be interleaved and multiple writes can be interleaved too. It is just that the physical connection to the memory can only handle so many bits at a time, so setting the clock rate sets the maximum the 80GiB/s theoretical peak. A near peak 70GiB/s is achievable via writing only, reading only, or by reading and writing, but a physical trace on the board can only carry one bit at any given instant so you cannot read and write simultaneously.

You may see your throughput increased reading 64-bit words (int2, float2, …).


I did a couple of tests with various word sizes, but saw no benefit to larger data items,

and in fact I saw a performance drop for float4 when doing copies. I believe I heard others mention the performance drop for float4 before so I’ve avoided using it for global memory operations in all of my kernels. I timed a “copy” kernel based on the one posted earlier in this thread (eliminating the scaling operation).


Average time: 0.443144 ms

Bandwidth: 70.518830 GiB/s


Average time: 0.890649 ms

Bandwidth: 70.173532 GiB/s


Average time: 2.633065 ms

Bandwidth: 47.473201 GiB/s

My results agree with John’s. I actually doubled the performance of some very simple kernels with coalesced float4 read/writes to use textured float4 reads instead. I was quite surprised at this. Seeing the float2 results here makes me wonder if I could get even more performance by using shared memory staging in the block to turn my float4 writes into float2 writes… hmm…

I think the situation with using float4 in textures vs global mem may be a bit different. If I’m not mistaken, the texture unit probably works natively with float4 values. When working with textures, the float4 may be a big win. With the global memory copy test I ran it sure wasn’t however. I’m swamped finishing some work presently otherwise I’d want to go measure the texture performance for the various types. If someone else has a chance to compare the float4 tex ops vs global mem ops, I’d be curious to know the answer. One other interesting tidbit that occurs to me on this topic: I saw the specs given for the new 512MB G9x based 8800GTS on one of the tech sites, and it indicated that the memory system on the new card only gives around 60ish GB/sec memory bandwidth, but a texture fill rate that’s higher than any of the existing cards. With the next gen hardware, it may be that the texture unit is a much bigger win than it has been up to this point with the 8800GTX and other first gen CUDA-capable cards. I guess we’ll know more in a few more days when the official G9x based 8800GTS info is announced by NVIDIA.



Here is the full gammut of tests. I also added a write only test (which just writes 0.0 using the specified type) and a read only test that reads a value into shared memory and then adds 1.0 to it (the compiler doesn’t optimize away the read in that case)

float copies
Average time: 0.448031 ms
Bandwidth: 69.749637 GiB/s

float2 copies
Average time: 0.952708 ms
Bandwidth: 65.602443 GiB/s

float4 copies
Average time: 2.650514 ms
Bandwidth: 47.160666 GiB/s

float copies with tex1DFetch reads
Average time: 0.478583 ms
Bandwidth: 65.296948 GiB/s

float2 copies with tex1DFetch reads
Average time: 0.935270 ms
Bandwidth: 66.825600 GiB/s

float4 copies with tex1DFetch reads
Average time: 1.870630 ms
Bandwidth: 66.822419 GiB/s

float write only
Average time: 0.315825 ms
Bandwidth: 49.473655 GiB/s

float2 write only
Average time: 0.621241 ms
Bandwidth: 50.302537 GiB/s

float4 write only
Average time: 1.232398 ms
Bandwidth: 50.714140 GiB/s

float read only
Average time: 0.405090 ms
Bandwidth: 38.571712 GiB/s

float2 read only
Average time: 0.549309 ms
Bandwidth: 56.889636 GiB/s

float4 read only
Average time: 2.121910 ms
Bandwidth: 29.454591 GiB/s

float read only with tex1DFetch reads
Average time: 0.351818 ms
Bandwidth: 44.412217 GiB/s

float2 read only with tex1DFetch reads
Average time: 0.480735 ms
Bandwidth: 65.004628 GiB/s

float 4 read only with tex1DFetch reads
Average time: 0.899641 ms
Bandwidth: 69.472176 GiB/s

I find it a little strange that it only pushes 50GiB/s with the write only tests. The results mainly speak for themselves, and confirm my earlier post about gaining performance using textures for float4 reads instead of coalesced global reads. Though, the write only tests seem to indicate that I have little to gain transforming my writes from float4 to float2… float4 writes are no slower here.

Interesting, thanks for the numbers. Did you write that up as a single test kernel, or did you just re-run after changing the code each time? If you’ve got source, I’d like to play around with your version some more on a couple different boards here early next week.


Interesting, I just found this relevant discussion in the programming guide for 1.1 (I don’t recall it in 1.0… maybe I missed it). Seems we’ve been experimenting to determine documented performance values.

“Coalesced 64-bit accesses deliver a little lower bandwidth than coalesced 32-bit accesses and coalesced 128-bit accesses deliver a noticeably lower bandwidth than coalesced 32-bit accesses. But, while bandwidth for non-coalesced accesses is around an order of magnitude lower than for coalesced accesses when these accesses are 32-bit, it is only around four times lower when they are 64-bit and around two times when they are 128-bit.”

Still, the texture tests done are useful showing that a simple tex1Dfetch on 128-bit types can win the day, and the new read/write only tests are illuminating too, in the right context. Though I doubt any “real” kernel would ever be read only… maybe a really big reduction leading to 1 value written per block is essentially read only.

I’ve attached the source for the full bw test. It’s a messy conglomeration of templates and preprocessor macros, but it gets the job done. I know it will probably be one of the first programs I run on any new card, and hopefully others find it useful as memory bandwidth is so important to achieving full performance in CUDA. (1.32 KB)

I hate to dig up an old thread, but I have some new info regarding compute 1.1 parts.

Here are the results from running bw_test on my 8800 GTS 512MB (G92).

copy_gmem<float> - Bandwidth:    50.855227 GiB/s

copy_gmem<float2> - Bandwidth:    50.187194 GiB/s

copy_gmem<float4> - Bandwidth:    49.017876 GiB/s

copy_tex<float> - Bandwidth:    46.296185 GiB/s

copy_tex<float2> - Bandwidth:    49.164148 GiB/s

copy_tex<float4> - Bandwidth:    48.585336 GiB/s

write_only<float> - Bandwidth:    38.938256 GiB/s

write_only<float2> - Bandwidth:    39.408719 GiB/s

write_only<float4> - Bandwidth:    39.637897 GiB/s

read_only_gmem<float> - Bandwidth:    47.793617 GiB/s

read_only_gmem<float2> - Bandwidth:    43.624361 GiB/s

read_only_gmem<float4> - Bandwidth:    33.624261 GiB/s

read_only_tex<float> - Bandwidth:    50.290276 GiB/s

read_only_tex<float2> - Bandwidth:    54.349253 GiB/s

read_only_tex<float4> - Bandwidth:    53.733277 GiB/s

Note how the float4 coalesced reads are at full bandwidth, unlike with the 8000 GTX benchmarks above. This means that you don’t need to use the tex1Dfetch float4 read workaround if you are targeting sm11 hardware, but using the texture doesn’t hurt either.

I’m not sure i understand what you said or there’s something wrong with reported results.

The read only tests are a little weird as the device doesn’t seem capable of performing them at full speed. Or the benchmark is flawed.

My new results are in reference to comparing the copy_gmem performance between the 8800 GTX (G80) and 8800 GTS (G92). The G92 runs at a full 50 GiB/s for float, float2, and float4 while the G80 card is slow for float4.

Thank you, these performance are important to archive the best performance. However i’m still wonder, what cause the differences in the performance with different input types: hardware capability, driver or cuda API ? What is the best way to optimize a program and that still archive peak performance with the next hardware generation, new drivers or new APIs ?