GT 200 performance questions Is it possible to achieve IPC > 1?

I hope someone with deeper knowledge about GT 200 architecture can help me with theses questions:

  • Can more than one instruction be dispatched at any cycle? I know The Programming Guide says “the SIMT unit selects a warp that is ready to execute
    and issues the next instruction” implying no more than SINGLE instruction can be dispatched. Is there any 2 or more type of instruction combination that can be dispatched at the same cycle? Basically, is it possible to get IPC higher than 1 on a SP core in an ideal scenario, i.e., full utilization with many threads, full memory (and pipeline) latency hiding, etc.?

  • There are different references to the Texture cache size (6 to 8 KB in the Prog Guide and 16KB in CUDA 2.1 FAQ). Any idea which one is right? Also, it does not seem to have multiple banks or multiple ports, right? What is the cache line width?


Dual issue of instructions is possible on GT200; for example, you can execute a MAD and a MUL within the same four-cycle window.

This Real World Tech article on the GT200 indicates that it is possible to overlap some instructions.…8195242&p=9

Each instruction takes at least 4 cycles of the “fast” clock, but an instruction can be issued every two cycles, so if you have instructions that can use the 32 bit ALU/FPU and the FMUL/SFU, you can have two instructions executing at the same time.

Thank you, Tim and bitsurge for your replies. The article was very helpful.

I’m mainly concerned about integer operations. In a benchmark, I have a tight loop executing a number of integer, logic ops, and comparisons. Comparing the actual PTX instructions count (in the cuda binary) of the loop against the execution time gives me an IPC around 1.05 (assuming all cores are fully utilized). And I’ve been scratching my head since on how this can happen!

Sounds that I got my answer. Parallel execution by Branch and ALU units would make my IPC calculation right…

Note first that I am not an expert in this low level GPU stuff, but I will add that PTX is not the final device code. If you really want to analyze something as low level as IPC, you should be working from the cubin (look up decuda).

Thanks, MisterAnderson42. My reference to “PTX instruction” was incorrect. I meant counting GPU machine code retrieved from the decuda-ed binary which was not the right term as you suggested. As far as I’m aware, there seem to be lack of good term for GPU native instructions… I often found myself referring to both stages of code as PTX instructions especially because all native instructions seem to have equivalent PTX instructions (although not the other way around)…

BTW, I appreciate it if someone can refer me to more detailed info on Texture caches. The Real World article helped a bit but not much. Why 24KB is partitioned to 3x8KB? Is each dedicated for one SM or it can serve 3 different requests initiated from threads of “single” SM? Compared to shared memory, I see good performance with texture memory and curious to know why this happens. It looks to me that either its cache line is quite wide (more than 64 bytes) or its simpler address calculation results in comparable and sometimes better performance…