I am testing some of the integer operations on a GTX580 card (Fermi, sm_20), and found that the 32-bit integer add has only half of the throughput provided in CUDA C Programming Guide version 4.0 (p. 98), which is 16 op/(clockSM), instead of 32 op/(clockSM). Could anybody confirm or explain this for me? Thanks!
Regards
Benetion
BTW. I believe what I found is consistent with the test shown on this page: Beyond3D - NVIDIA Fermi GPU and Architecture Analysis,
that is, the scalar integer operations have only half of the throughput of single-precision flops. Does anybody know why this is? Thanks.
Are you actually measuring a multiply-add instruction ? Because that differs from floating point in that it’s actually only 16 ops / CC / SM (floating point can do that in one CC i.e. fused multiply-add).
I understand that; it was a mistake to count it as one (now corrected in my previous post). I have no question on the mad performance. My only question here is on the performance of integer add.
I was just using mad as the “benchmark” of my code, which is consistent with the cuda c programming guide.
Please, could anybody confirm or explain to me that why the throughput of integer add is only half of what is listed in cuda c programming guide (v. 4.0, p.98 Table 5-1, sm_20). What I found is only 16 add.s32/cc/sm (on GTX580). Thanks!
This is the code I used, modified from the code Simon Green posted. You can modify the operation (add, fmad) on line 64.
Disclaimer: I only tried this quickly, so please verify yourself!
/*
* Copyright 1993-2007 NVIDIA Corporation. All rights reserved.
*
* NOTICE TO USER:
*
* This source code is subject to NVIDIA ownership rights under U.S. and
* international Copyright laws. Users and possessors of this source code
* are hereby granted a nonexclusive, royalty-free license to use this code
* in individual and commercial software.
*
* NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE
* CODE FOR ANY PURPOSE. IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR
* IMPLIED WARRANTY OF ANY KIND. NVIDIA DISCLAIMS ALL WARRANTIES WITH
* REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF
* MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
* IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,
* OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS
* OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
* OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE
* OR PERFORMANCE OF THIS SOURCE CODE.
*
* U.S. Government End Users. This source code is a "commercial item" as
* that term is defined at 48 C.F.R. 2.101 (OCT 1995), consisting of
* "commercial computer software" and "commercial computer software
* documentation" as such terms are used in 48 C.F.R. 12.212 (SEPT 1995)
* and is provided to the U.S. Government only as a commercial end item.
* Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through
* 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the
* source code with only those rights set forth herein.
*
* Any use of this source code in individual and commercial software must
* include, in the user documentation and internal comments to the code,
* the above Disclaimer and U.S. Government End Users Notice.
*/
/*
This sample is intended to measure the peak computation rate of the GPU in GFLOPs
(giga floating point operations per second).
It executes a large number of multiply-add operations, writing the results to
shared memory. The loop is unrolled for maximum performance.
Depending on the compiler and hardware it might not take advantage of all the
computational resources of the GPU, so treat the results produced by this code
with some caution.
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <typeinfo>
#include <cutil.h>
#define NUM_SMS (14*8)
#define NUM_THREADS_PER_SM (1024)
#define NUM_THREADS_PER_BLOCK (512)
#define NUM_BLOCKS ((NUM_THREADS_PER_SM / NUM_THREADS_PER_BLOCK) * NUM_SMS)
#define NUM_ITERATIONS 10
//#define OPERATION(a,b) (b & a)
//#define OPERATION(a,b) (b + a)
//#define OPERATION(a,b) (b * a)
#define OPERATION(a,b) (b * a + b)
#define COMLEXITY 2 //=1 for ADD, =2 for FMAD
// 128 MAD instructions
#define FUNCTION(a, b) \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
a = OPERATION(a,b); \
b = OPERATION(b,a); \
template <typename T>
__global__ void gflops()
{
__shared__ T result[NUM_THREADS_PER_BLOCK];
result[threadIdx.x] = threadIdx.x;
T a = result[threadIdx.x]; // this ensures the mads don't get compiled out
T b = (T)7;
#pragma unroll 2
for (int i = 0; i < NUM_ITERATIONS; i++)
{
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
FUNCTION(a, b);
}
result[threadIdx.x] = a + b;
}
template <typename T>
void Run(void)
{
// warmup
gflops<T><<<NUM_BLOCKS, NUM_THREADS_PER_BLOCK>>>();
CUDA_SAFE_CALL( cudaThreadSynchronize() );
// execute kernel
unsigned int timer = 0;
CUT_SAFE_CALL( cutCreateTimer( &timer));
CUT_SAFE_CALL( cutStartTimer( timer));
gflops<T><<<NUM_BLOCKS, NUM_THREADS_PER_BLOCK>>>();
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUT_SAFE_CALL( cutStopTimer( timer));
float time = cutGetTimerValue( timer);
// output results
printf( "Time: %f (ms)\n", time);
const double flops = (double)128 * COMLEXITY * 16 * NUM_ITERATIONS * NUM_BLOCKS * NUM_THREADS_PER_BLOCK;
printf("Gflops <%s>: %f\n\n", typeid(T).name(), (flops / (time / 1000.0f)) / 1e9 );
CUT_SAFE_CALL( cutDeleteTimer( timer));
}
int main(int argc, char** argv)
{
CUT_DEVICE_INIT(argc, argv);
//make sure kernel has plenty of shared memory
cudaFuncSetCacheConfig(gflops<int>, cudaFuncCachePreferShared);
cudaFuncSetCacheConfig(gflops<float>, cudaFuncCachePreferShared);
Run<float>();
Run<int>();
CUT_EXIT(argc, argv);
}
Today I looked into the problem more closely, and I think (pretty sure) I found the answer to why the integer-add performance is only half.
After inspecting the GPU assembly (using cuobjdumb -sass), I noticed that the floating point additions get translated to “FADD.FTZ R3, R3, R4”, but the integer additions get translated to “VADD.ACC R2, R2, R3, R3”. I’m not sure what this instruction exactly is, but I guess it means something like “vector add accumulate”, with four operands. So most likely, this instruction will do two additions in stead of only one. The number of “FADD” and “VADD” instructions in each kernel is the same, resulting in half the performance of the integer addition kernel.
Now we know the problem, it’s time to fix it. As shown in my results a couple of posts back, performance was as expected (by the programming guide) on my GTX 460 machine. This machine still uses CUDA 3.2, and is 32-bit, but copying the binary to the GTX-470 machine (CUDA 4.0 and 64-bit) showed that this binary gives the full (not half) integer-add performance also on the GTX-470 machine. So the problem could well be in nvcc-4.0.
Installing the CUDA toolkit 3.2 (with nvcc-3.2) on the GTX-470 machine did indeed solve the problem, and I now have full integer-add performance back! However, I need to specify the target architecture (e.g. sm_20) to get full integer-add performance, if I specify it wrongly, (like sm_11), I get half performance. So I guess something (also) goes wrong in the just-in-time compilation by the driver, but I have not figured that out yet.
I don’t know if this ‘bug’ (strange compiler behavior) has already been reported to NVIDIA, but I guess they read their own forums.
The && operation gets translated to two instructions for integers: “ICMP.EQ R2, RZ, 0x1, R4” and “LOP.AND R3, R2, R3”. For floats the && operation gets translated to three instructions: “I2F.F32.S32 R3, R3;”, “FCMP.EQ.FTZ R3, RZ, c [0x10] [0x0], R3” and “LOP.AND R4, R3, R4”. So performance for integer-&& is only half compared to integer-add, only one third for float, see also results below:
The funny thing is, I could not compile the example code with the &&-operator using nvcc-3.2, it gave me this: “nvopencc INTERNAL ERROR ### ran out of registers in predicate”. Luckily NVIDIA fixed this in nvcc-4.0.
VADD is a three-input addition instruction. From your analysis it appears that with the CUDA 4.0 toolchain this instruction is used in situations where its use lowers rather than improves performance. Since you appear to have a working repro case, it would be helpful if you could file a bug against the compiler, attaching the repro code and your analysis including the JIT observations. Thank you for your help.