Tesla K40 questions

timmilliken · October 21, 2014, 5:02pm

if a thread run aprox 1500 ptx instructions, how long would it take a K40 to go from 0 to max 512 bit? I have calculated it, but want someone to give their math so I can tell if I am right or wrong.

-Tim

Greg · October 22, 2014, 6:22pm

Tim, Can you please clarify the question. I am not clear on the relationship between 1500 PTX instructions and 0 to max 512 bit. What is the relationship between instruction count per thread and 0 to max 512 bit? What do you mean go from 0 to 512 bit?

timmilliken · October 22, 2014, 7:23pm

run a thread that contains 1500 instructions a total od max 512 bit times. Example is to have a 512bit buffer and it starts at 0 and ++ the buffer until is is all FF’s (max 512 bit).

-Tim

njuffa · October 22, 2014, 7:41pm

Sorry, this description still isn’t clear at all. Is the question “How many 512-bit integer additions can the K40 perform per second?”

As for counting PTX instructions, it is pretty meaningless. PTX is a virtual assembly language which serves as the intermediate representation in the CUDA compiler and is compiled by the compiler’s PTXAS component into machine code (SASS). Compiler optimizations could eliminate a given instruction, and even if an PTX instruction is retained it may turn into one SASS instruction, or tens of SASS instructions, or even a subroutine call. Many PTX instructions are emulated, as there is no direct machine code equivalent. For example, 64-bit signed integer division which is a single PTX instruction is a subroutine comprising about 70 instructions at SASS level (if I recall correctly).

You can inspect the machine code by running cuobjdump --dump-sass on the executable produced by nvcc. Execution speed will not just be determined by SASS instruction count, but also instruction type, among other things. This is because different types of instructions have a different throughput, and the relative throughput can (and does) differ by GPU architecture. So an integer add might have higher throughput than an integer shift or an integer multiply.

The CUDA C Programming Guide lists throughput numbers for various basic instruction types, as far as I recall. However, these are maximum throughputs, and real-life instruction throughput may be reduced by factors such as decode bottlenecks, execution pipe choices during op-steering, register bank conflicts etc.

Lastly, you would never want to derive any performance metrics by running just a single thread on a GPU. GPUs are designed as throughput machines in which thousands of threads run concurrently, with zero-overhead context switching between threads, and in this way covering basic latencies (such as operand and instruction fetch).

I would contend that for many real-life use cases, instruction counting is a very poor predictor of app-level performance, and it is best to measure the performance using the actual code.

timmilliken · October 22, 2014, 8:00pm

I am attempting to brute force some encryption. I have narrowed the instructions to about 1500 ptx instructions. They are composed of shifts and adds. I understand what you are saying though. I need to run the code in one thread and use the windows high precision timer and test performance of it.

-Tim

Greg · October 23, 2014, 1:17am

Tim,

If you make the following assumptions:

Assume GPU runs at 1 GHz
Assume you launch 1 thread that will add 1 to a 512 bit unsigned integer (custom class) until the 512 bit integer overflows.

uint512_t i = 1;
while (i != 0) ++i;

Assume that the add can be done every cycle (which is not possible even for 32-bit integers as there is a dependency between instructions).

The rough estimate would be 2^512 / 2^30 adds/sec = 2^482.

I would describe this as taking infinite time. Given the above assumptions an integer width 46-47 bits would take a full day.

The statement of the number of PTX instructions still is unclear.

njuffa · October 23, 2014, 5:13am

I am assuming here that 512-bit integer adds are coded at PTX level, not via C macros as in this answer of mine on Stackoverflow:

[url]cuda - Is inline PTX assembly code powerful? - Stack Overflow

Then 512-bit integer add throughput is dominated by the throughput of 32-bit add-with-carry instructions. I am not entirely sure that add-with-carry has the same throughput as simple integer add on Kepler. Assuming it does, the following speed-of-light computation should apply:

With 2880 CUDA cores running at 745 MHz (the K40 base clock), the K40 has a throughput of 715.2e9 32-bit integer adds per second. A 512-bit addition requires sixteen 32-bit addition instructions, so you would be doing at most 44.7e9 (~ 235) 512-bit additions per second, or 3.86e15 (~ 252) additions per day.

As Greg points out, at that rate counting to 2**512-1 would take much longer than the earth will continue to exist.

Topic		Replies	Views
estimate 64bit integer instruction throughput CUDA Programming and Performance	4	962	September 29, 2018
what is the number of operations in one kernel help CUDA Programming and Performance	8	7245	May 25, 2010
The throughput of 32 bit Integer add instructions not reaching the theoretical maximum of 160 per SM CUDA Programming and Performance	5	1319	January 7, 2014
Kernel max instructions? CUDA Programming and Performance	8	1811	February 8, 2018
G80 - 14 clocks per Instruction ? CUDA Programming and Performance	4	3288	March 4, 2008
Crowd sourcing request: help me time the PTX ISA. CUDA Programming and Performance	8	2021	July 2, 2019
How to find out how many ptx instructions are in the kernel ? Keeping in mind the 2 million ptx inst CUDA Programming and Performance	11	7402	September 18, 2009
PTX Assembly Instruction timing project (comments and questions) trying to calculate instruction tim CUDA Programming and Performance	0	1326	April 27, 2009
PTX instructions CUDA Programming and Performance	1	1124	February 16, 2009
throughput of integer add CUDA Programming and Performance	17	3318	August 15, 2011

Tesla K40 questions

Related topics