throughput of integer add

Hi everyone,

I am testing some of the integer operations on a GTX580 card (Fermi, sm_20), and found that the 32-bit integer add has only half of the throughput provided in CUDA C Programming Guide version 4.0 (p. 98), which is 16 op/(clockSM), instead of 32 op/(clockSM). Could anybody confirm or explain this for me? Thanks!

Regards
Benetion

BTW. I believe what I found is consistent with the test shown on this page: http://www.beyond3d.com/content/reviews/55/11,
that is, the scalar integer operations have only half of the throughput of single-precision flops. Does anybody know why this is? Thanks.

No one cares? …

It should be 32 ops / cc / SM.

Are you actually measuring a multiply-add instruction ? Because that differs from floating point in that it’s actually only 16 ops / CC / SM (floating point can do that in one CC i.e. fused multiply-add).

Could you elaborate the “should” a little bit? Thanks.

I am measuring add only, which can be seen from corresponding ptx:

source: a = a + a; …

ptx: add.s32 …

I have tested mad as well, which is 32 op/cc/sm (counting mad as 2 op according to industrial standards).

Ok, so your testing shows that both add and mad give you 16 op/cc/sm ? That’s very odd, could you share your code ?

By “should” I meant that the documentation says so.

Are you launching with a single warp?

Sorry, I was counting one mad as one op; if it is counted as two, then it is 32 op/cc/sm as documented in the cuda c programming guide.

The code is adopted from Simon Green posted:

http://forums.nvidia.com/index.php?showtopic=45752&st=0&p=250179&#entry250179

I am using <<<32,512>>> (100% occupancy). I have also tried <<<128,64>>>, which gives the same throughput (i.e., 16 add.s32/cc/sm).

Thanks, this is a good question.

Peak numbers from NVIDIA are given based on mad, where each mad is considered 2 ops. This is pretty standard throughout the industry.

I understand that; it was a mistake to count it as one (now corrected in my previous post). I have no question on the mad performance. My only question here is on the performance of integer add.

I was just using mad as the “benchmark” of my code, which is consistent with the cuda c programming guide.

Please, could anybody confirm or explain to me that why the throughput of integer add is only half of what is listed in cuda c programming guide (v. 4.0, p.98 Table 5-1, sm_20). What I found is only 16 add.s32/cc/sm (on GTX580). Thanks!

Looks like you are right, and the programming guide is wrong. These are my results on a GTX 470 (compute capability 2.0):

complexity    float (GFLOPS/s)    int (GFLOPS/s)

logical operation    1               -                 509

add                  1             507                 272

multiply             1             507                 270

multiply-add         2             961                 541

And on a GTX 460 (compute capability 2.1):

complexity    float (GFLOPS/s)    int (GFLOPS/s)

logical operation    1               -                 312

add                  1             312                 312

multiply             1             312                 156

multiply-add         2             534                 312

This is the code I used, modified from the code Simon Green posted. You can modify the operation (add, fmad) on line 64.

Disclaimer: I only tried this quickly, so please verify yourself!

/*

 * Copyright 1993-2007 NVIDIA Corporation.  All rights reserved.

 *

 * NOTICE TO USER:

 *

 * This source code is subject to NVIDIA ownership rights under U.S. and

 * international Copyright laws.  Users and possessors of this source code

 * are hereby granted a nonexclusive, royalty-free license to use this code

 * in individual and commercial software.

 *

 * NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE

 * CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR

 * IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH

 * REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF

 * MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.

 * IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,

 * OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS

 * OF USE, DATA OR PROFITS,  WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE

 * OR OTHER TORTIOUS ACTION,  ARISING OUT OF OR IN CONNECTION WITH THE USE

 * OR PERFORMANCE OF THIS SOURCE CODE.

 *

 * U.S. Government End Users.   This source code is a "commercial item" as

 * that term is defined at  48 C.F.R. 2.101 (OCT 1995), consisting  of

 * "commercial computer  software"  and "commercial computer software

 * documentation" as such terms are  used in 48 C.F.R. 12.212 (SEPT 1995)

 * and is provided to the U.S. Government only as a commercial end item.

 * Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through

 * 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the

 * source code with only those rights set forth herein.

 *

 * Any use of this source code in individual and commercial software must

 * include, in the user documentation and internal comments to the code,

 * the above Disclaimer and U.S. Government End Users Notice.

 */

/*

    This sample is intended to measure the peak computation rate of the GPU in GFLOPs

    (giga floating point operations per second).

It executes a large number of multiply-add operations, writing the results to

    shared memory. The loop is unrolled for maximum performance.

Depending on the compiler and hardware it might not take advantage of all the

    computational resources of the GPU, so treat the results produced by this code

    with some caution.

*/

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

#include <typeinfo>

#include <cutil.h>

#define NUM_SMS (14*8)

#define NUM_THREADS_PER_SM (1024)

#define NUM_THREADS_PER_BLOCK (512)

#define NUM_BLOCKS ((NUM_THREADS_PER_SM / NUM_THREADS_PER_BLOCK) * NUM_SMS)

#define NUM_ITERATIONS 10

//#define OPERATION(a,b) (b & a)

//#define OPERATION(a,b) (b + a)

//#define OPERATION(a,b) (b * a)

#define OPERATION(a,b) (b * a + b)

#define COMLEXITY     2 //=1 for ADD, =2 for FMAD

// 128 MAD instructions

#define FUNCTION(a, b) \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

      a = OPERATION(a,b); \

      b = OPERATION(b,a); \

template <typename T>

__global__ void gflops()

{

	__shared__ T result[NUM_THREADS_PER_BLOCK];

	result[threadIdx.x] = threadIdx.x;

	

    T a = result[threadIdx.x];  // this ensures the mads don't get compiled out

    T b = (T)7;

	

	#pragma unroll 2

    for (int i = 0; i < NUM_ITERATIONS; i++)

    {

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

        FUNCTION(a, b);

    }

    result[threadIdx.x] = a + b;

}

template <typename T>

void Run(void)

{

    // warmup

    gflops<T><<<NUM_BLOCKS, NUM_THREADS_PER_BLOCK>>>();

    CUDA_SAFE_CALL( cudaThreadSynchronize() );

// execute kernel

    unsigned int timer = 0;

    CUT_SAFE_CALL( cutCreateTimer( &timer));

    CUT_SAFE_CALL( cutStartTimer( timer));

gflops<T><<<NUM_BLOCKS, NUM_THREADS_PER_BLOCK>>>();

CUDA_SAFE_CALL( cudaThreadSynchronize() );

    CUT_SAFE_CALL( cutStopTimer( timer));

    float time = cutGetTimerValue( timer);

// output results

    printf( "Time: %f (ms)\n", time);

    const double flops = (double)128 * COMLEXITY * 16 * NUM_ITERATIONS * NUM_BLOCKS * NUM_THREADS_PER_BLOCK;

    printf("Gflops <%s>: %f\n\n", typeid(T).name(), (flops / (time / 1000.0f)) / 1e9 );

CUT_SAFE_CALL( cutDeleteTimer( timer));

}

int main(int argc, char** argv) 

{

	CUT_DEVICE_INIT(argc, argv);

	//make sure kernel has plenty of shared memory

	cudaFuncSetCacheConfig(gflops<int>,   cudaFuncCachePreferShared);

	cudaFuncSetCacheConfig(gflops<float>, cudaFuncCachePreferShared);

	

	Run<float>();

	Run<int>();

		

	CUT_EXIT(argc, argv);

}

Beautiful. Thanks, Gert-Jan.

If you replace & (bitwise) by &&(logical), the throughput is also only half of the peak int mad throuput.

Today I looked into the problem more closely, and I think (pretty sure) I found the answer to why the integer-add performance is only half.

After inspecting the GPU assembly (using cuobjdumb -sass), I noticed that the floating point additions get translated to “FADD.FTZ R3, R3, R4”, but the integer additions get translated to “VADD.ACC R2, R2, R3, R3”. I’m not sure what this instruction exactly is, but I guess it means something like “vector add accumulate”, with four operands. So most likely, this instruction will do two additions in stead of only one. The number of “FADD” and “VADD” instructions in each kernel is the same, resulting in half the performance of the integer addition kernel.

Now we know the problem, it’s time to fix it. As shown in my results a couple of posts back, performance was as expected (by the programming guide) on my GTX 460 machine. This machine still uses CUDA 3.2, and is 32-bit, but copying the binary to the GTX-470 machine (CUDA 4.0 and 64-bit) showed that this binary gives the full (not half) integer-add performance also on the GTX-470 machine. So the problem could well be in nvcc-4.0.

Installing the CUDA toolkit 3.2 (with nvcc-3.2) on the GTX-470 machine did indeed solve the problem, and I now have full integer-add performance back! However, I need to specify the target architecture (e.g. sm_20) to get full integer-add performance, if I specify it wrongly, (like sm_11), I get half performance. So I guess something (also) goes wrong in the just-in-time compilation by the driver, but I have not figured that out yet.

I don’t know if this ‘bug’ (strange compiler behavior) has already been reported to NVIDIA, but I guess they read their own forums.

New results on GTX-470 machine:

complexity    float (GFLOPS/s)    int (GFLOPS/s)

logical operation    1               -                 515

add                  1             520                 508

multiply             1             508                 271

multiply-add         2             958                 541

Excellent.

What do you think of the && operation? Thanks.

The && operation gets translated to two instructions for integers: “ICMP.EQ R2, RZ, 0x1, R4” and “LOP.AND R3, R2, R3”. For floats the && operation gets translated to three instructions: “I2F.F32.S32 R3, R3;”, “FCMP.EQ.FTZ R3, RZ, c [0x10] [0x0], R3” and “LOP.AND R4, R3, R4”. So performance for integer-&& is only half compared to integer-add, only one third for float, see also results below:

complexity    float (GFLOPS/s)    int (GFLOPS/s)

add                  1             520                 508

&&                   1             172                 259

The funny thing is, I could not compile the example code with the &&-operator using nvcc-3.2, it gave me this: “nvopencc INTERNAL ERROR ### ran out of registers in predicate”. Luckily NVIDIA fixed this in nvcc-4.0.

VADD is a three-input addition instruction. From your analysis it appears that with the CUDA 4.0 toolchain this instruction is used in situations where its use lowers rather than improves performance. Since you appear to have a working repro case, it would be helpful if you could file a bug against the compiler, attaching the repro code and your analysis including the JIT observations. Thank you for your help.

I just send in the bug report and mentioned this topic, let’s see what happens.