I am verifying the throughput of the 32 bit addition instructions by inserting 1000 add.32 instruction in the PTX file(refer attachment), for compute capability 3.0 GPU(GTX 650Ti, which has 4 SMX units). CUDA programming guide mentions the throughput as 160 integer additions per SMX.
Following are the test conditions and observed results
Block size = 128 threads
Number of blocks = 391 Blocks
Total number of threads = 50048 threads
Total number of additions(add.s32) performed = 50048 *1000
CUDA Occupancy = 100%
The measured timing difference before and after adding 1000 add.s32 instructions is 114 micro.sec.
Number of cycles taken = 114 microseconds 941 MHz, where 941 MHz is the frequency of GTX 650Ti GPU.
Throughput per cycle (for the entire GPU) = (50,0481000 additions)/( 114*941 cycles) = 466 additions per cycle
Throughput per cycle for each SMX = 466 / 4 = 117 additions per cycle for each SMX
So I am achieving a throughput of only 117 additions per SMX instead of 160 additions as specified in CUDA programming guide, even with 100% occupancy.
What should be done in order to achieve a throughput of 160? Any help will be greatly appreciated. Thanks.
I am attaching the PTX file, CPP file and Makefile for your reference.
To build and run the experiment in your set up, copy these files in CUDA sample code path: NVIDIA_CUDA-5.5_Samples\0_Simple\vectorAddDrv folder in your CUDA installation, and build using the Makefile copied
Sample_code.zip (8.23 KB)