GT 200 performance questions Is it possible to achieve IPC > 1?

rostam · January 6, 2009, 8:14pm

I hope someone with deeper knowledge about GT 200 architecture can help me with theses questions:

Can more than one instruction be dispatched at any cycle? I know The Programming Guide says “the SIMT unit selects a warp that is ready to execute
and issues the next instruction” implying no more than SINGLE instruction can be dispatched. Is there any 2 or more type of instruction combination that can be dispatched at the same cycle? Basically, is it possible to get IPC higher than 1 on a SP core in an ideal scenario, i.e., full utilization with many threads, full memory (and pipeline) latency hiding, etc.?
There are different references to the Texture cache size (6 to 8 KB in the Prog Guide and 16KB in CUDA 2.1 FAQ). Any idea which one is right? Also, it does not seem to have multiple banks or multiple ports, right? What is the cache line width?

thanks

tmurray · January 6, 2009, 9:02pm

Dual issue of instructions is possible on GT200; for example, you can execute a MAD and a MUL within the same four-cycle window.

bitsurge · January 6, 2009, 9:07pm

This Real World Tech article on the GT200 indicates that it is possible to overlap some instructions.

http://www.realworldtech.com/page.cfm?Arti…8195242&p=9

Each instruction takes at least 4 cycles of the “fast” clock, but an instruction can be issued every two cycles, so if you have instructions that can use the 32 bit ALU/FPU and the FMUL/SFU, you can have two instructions executing at the same time.

rostam · January 7, 2009, 1:53am

Thank you, Tim and bitsurge for your replies. The article was very helpful.

I’m mainly concerned about integer operations. In a benchmark, I have a tight loop executing a number of integer, logic ops, and comparisons. Comparing the actual PTX instructions count (in the cuda binary) of the loop against the execution time gives me an IPC around 1.05 (assuming all cores are fully utilized). And I’ve been scratching my head since on how this can happen!

Sounds that I got my answer. Parallel execution by Branch and ALU units would make my IPC calculation right…

MisterAnderson42 · January 7, 2009, 12:37pm

Note first that I am not an expert in this low level GPU stuff, but I will add that PTX is not the final device code. If you really want to analyze something as low level as IPC, you should be working from the cubin (look up decuda).

rostam · January 7, 2009, 3:46pm

Thanks, MisterAnderson42. My reference to “PTX instruction” was incorrect. I meant counting GPU machine code retrieved from the decuda-ed binary which was not the right term as you suggested. As far as I’m aware, there seem to be lack of good term for GPU native instructions… I often found myself referring to both stages of code as PTX instructions especially because all native instructions seem to have equivalent PTX instructions (although not the other way around)…

BTW, I appreciate it if someone can refer me to more detailed info on Texture caches. The Real World article helped a bit but not much. Why 24KB is partitioned to 3x8KB? Is each dedicated for one SM or it can serve 3 different requests initiated from threads of “single” SM? Compared to shared memory, I see good performance with texture memory and curious to know why this happens. It looks to me that either its cache line is quite wide (more than 64 bytes) or its simpler address calculation results in comparable and sometimes better performance…

Topic		Replies	Views
Understanding instruction dispatching in Volta architecture CUDA Programming and Performance	5	3521	December 12, 2019
Threads Dispatching : 2 different instructions per cycles? CUDA Programming and Performance	2	41	January 31, 2025
Clarifing the process of issuing instructions on CUDA devices CUDA Programming and Performance	5	336	March 26, 2024
need help understanding GT200 computing model some tough questions ;) CUDA Programming and Performance	3	2154	February 23, 2009
How many thread are executed at the same time ? CUDA Programming and Performance	9	7909	January 21, 2024
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24856	September 6, 2009
About the number of CUDA cores in SMSP, less or gerater than warp threads number(32) CUDA Programming and Performance	8	830	June 17, 2024
code instruction cache? CUDA Programming and Performance	12	4591	July 31, 2015
Does %clock measure actual GPU cycles, or what? CUDA Programming and Performance	5	1596	July 9, 2019
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12186	February 12, 2013

GT 200 performance questions Is it possible to achieve IPC > 1?

Related topics