The throughput of 32 bit Integer add instructions not reaching the theoretical maximum of 160 per SM

TurtleFace · November 23, 2013, 2:27pm

I am verifying the throughput of the 32 bit addition instructions by inserting 1000 add.32 instruction in the PTX file(refer attachment), for compute capability 3.0 GPU(GTX 650Ti, which has 4 SMX units). CUDA programming guide mentions the throughput as 160 integer additions per SMX.

Following are the test conditions and observed results

Block size = 128 threads
Number of blocks = 391 Blocks
Total number of threads = 50048 threads
Total number of additions(add.s32) performed = 50048 *1000
CUDA Occupancy = 100%

The measured timing difference before and after adding 1000 add.s32 instructions is 114 micro.sec.

Number of cycles taken = 114 microseconds 941 MHz, where 941 MHz is the frequency of GTX 650Ti GPU.
Throughput per cycle (for the entire GPU) = (50,0481000 additions)/( 114*941 cycles) = 466 additions per cycle
Throughput per cycle for each SMX = 466 / 4 = 117 additions per cycle for each SMX

So I am achieving a throughput of only 117 additions per SMX instead of 160 additions as specified in CUDA programming guide, even with 100% occupancy.

What should be done in order to achieve a throughput of 160? Any help will be greatly appreciated. Thanks.

I am attaching the PTX file, CPP file and Makefile for your reference.

To build and run the experiment in your set up, copy these files in CUDA sample code path: NVIDIA_CUDA-5.5_Samples\0_Simple\vectorAddDrv folder in your CUDA installation, and build using the Makefile copied
.
Sample_code.zip (8.23 KB)

ninjaTurtle · November 23, 2013, 3:45pm

Hi, I tried measuring throughput for 64 bit addition and caught with a similar issue. Hope some nvidia guys can sort this out.

seibert · November 23, 2013, 8:40pm

After a quick look through the code, it is not clear to me how you are performing the time measurement itself. Are you recording the runtime of the entire executable, using OS-provided time functions inside the program, using CUDA Events, or calling the clock() function on the device?

TurtleFace · November 25, 2013, 5:16am

@seibert , I am measuring the time using Nvprof. In particular I take the time taken only by the vecAdd_kernel function.

DavideBarbieri · January 5, 2014, 12:55am

Try to do the timing with CUDA events.
I found nvprof slow down a bit the controlled kernel…

pszilard · January 7, 2014, 8:57pm

You may want to put a long, fixed iteration, loop into the kernel to substantially increase the kernel runtime e.g. to seconds. This ensures that the runtime spent outside the kernel (launch overhead, timing overhead, etc.) is negligible. With these changes in place you use the usual CPU-side timing or use CUDA events, but you could even consider using clock() within the kernel.

Topic		Replies	Views
estimate 64bit integer instruction throughput CUDA Programming and Performance	4	849	September 29, 2018
unable to get maximum multiply add throughput on Fermi CC 2.1 CUDA Programming and Performance	6	1591	February 11, 2013
Verify cuda core peak fp32 performance CUDA Programming and Performance	10	479	May 2, 2024
Tesla K40 questions CUDA Programming and Performance	6	1545	October 23, 2014
Throughput test (add, mul, mod) giving strange result CUDA Programming and Performance	2	1204	February 4, 2014
Mythical Tflops CUDA Programming and Performance	11	1128	January 14, 2019
throughput of integer add CUDA Programming and Performance	17	3087	August 15, 2011
what is the double-precision flops rating of the gtx580? CUDA Programming and Performance	16	33467	April 10, 2014
Profiling performance and energy consumption of the basic operations CUDA Programming and Performance	5	264	September 25, 2024
192 cuda cores - how they are organized 6x32 or 4x32 + 4x16? CUDA Programming and Performance	5	3173	April 29, 2012

The throughput of 32 bit Integer add instructions not reaching the theoretical maximum of 160 per SM

Related topics