The throughput of 32 bit Integer add instructions not reaching the theoretical maximum of 160 per SM

I am verifying the throughput of the 32 bit addition instructions by inserting 1000 add.32 instruction in the PTX file(refer attachment), for compute capability 3.0 GPU(GTX 650Ti, which has 4 SMX units). CUDA programming guide mentions the throughput as 160 integer additions per SMX.

Following are the test conditions and observed results

Block size = 128 threads
Number of blocks = 391 Blocks
Total number of threads = 50048 threads
Total number of additions(add.s32) performed = 50048 *1000
CUDA Occupancy = 100%

The measured timing difference before and after adding 1000 add.s32 instructions is 114 micro.sec.

Number of cycles taken = 114 microseconds 941 MHz, where 941 MHz is the frequency of GTX 650Ti GPU.
Throughput per cycle (for the entire GPU) = (50,048
1000 additions)/( 114*941 cycles) = 466 additions per cycle
Throughput per cycle for each SMX = 466 / 4 = 117 additions per cycle for each SMX

So I am achieving a throughput of only 117 additions per SMX instead of 160 additions as specified in CUDA programming guide, even with 100% occupancy.

What should be done in order to achieve a throughput of 160? Any help will be greatly appreciated. Thanks.

I am attaching the PTX file, CPP file and Makefile for your reference.

To build and run the experiment in your set up, copy these files in CUDA sample code path: NVIDIA_CUDA-5.5_Samples\0_Simple\vectorAddDrv folder in your CUDA installation, and build using the Makefile copied
.
Sample_code.zip (8.23 KB)

Hi, I tried measuring throughput for 64 bit addition and caught with a similar issue. Hope some nvidia guys can sort this out.

After a quick look through the code, it is not clear to me how you are performing the time measurement itself. Are you recording the runtime of the entire executable, using OS-provided time functions inside the program, using CUDA Events, or calling the clock() function on the device?

@seibert , I am measuring the time using Nvprof. In particular I take the time taken only by the vecAdd_kernel function.

Try to do the timing with CUDA events.
I found nvprof slow down a bit the controlled kernel…

You may want to put a long, fixed iteration, loop into the kernel to substantially increase the kernel runtime e.g. to seconds. This ensures that the runtime spent outside the kernel (launch overhead, timing overhead, etc.) is negligible. With these changes in place you use the usual CPU-side timing or use CUDA events, but you could even consider using clock() within the kernel.