Tesla K40 questions

if a thread run aprox 1500 ptx instructions, how long would it take a K40 to go from 0 to max 512 bit? I have calculated it, but want someone to give their math so I can tell if I am right or wrong.

-Tim

Tim, Can you please clarify the question. I am not clear on the relationship between 1500 PTX instructions and 0 to max 512 bit. What is the relationship between instruction count per thread and 0 to max 512 bit? What do you mean go from 0 to 512 bit?

run a thread that contains 1500 instructions a total od max 512 bit times. Example is to have a 512bit buffer and it starts at 0 and ++ the buffer until is is all FF’s (max 512 bit).

-Tim

Sorry, this description still isn’t clear at all. Is the question “How many 512-bit integer additions can the K40 perform per second?”

As for counting PTX instructions, it is pretty meaningless. PTX is a virtual assembly language which serves as the intermediate representation in the CUDA compiler and is compiled by the compiler’s PTXAS component into machine code (SASS). Compiler optimizations could eliminate a given instruction, and even if an PTX instruction is retained it may turn into one SASS instruction, or tens of SASS instructions, or even a subroutine call. Many PTX instructions are emulated, as there is no direct machine code equivalent. For example, 64-bit signed integer division which is a single PTX instruction is a subroutine comprising about 70 instructions at SASS level (if I recall correctly).

You can inspect the machine code by running cuobjdump --dump-sass on the executable produced by nvcc. Execution speed will not just be determined by SASS instruction count, but also instruction type, among other things. This is because different types of instructions have a different throughput, and the relative throughput can (and does) differ by GPU architecture. So an integer add might have higher throughput than an integer shift or an integer multiply.

The CUDA C Programming Guide lists throughput numbers for various basic instruction types, as far as I recall. However, these are maximum throughputs, and real-life instruction throughput may be reduced by factors such as decode bottlenecks, execution pipe choices during op-steering, register bank conflicts etc.

Lastly, you would never want to derive any performance metrics by running just a single thread on a GPU. GPUs are designed as throughput machines in which thousands of threads run concurrently, with zero-overhead context switching between threads, and in this way covering basic latencies (such as operand and instruction fetch).

I would contend that for many real-life use cases, instruction counting is a very poor predictor of app-level performance, and it is best to measure the performance using the actual code.

I am attempting to brute force some encryption. I have narrowed the instructions to about 1500 ptx instructions. They are composed of shifts and adds. I understand what you are saying though. I need to run the code in one thread and use the windows high precision timer and test performance of it.

-Tim

Tim,

If you make the following assumptions:

  • Assume GPU runs at 1 GHz
  • Assume you launch 1 thread that will add 1 to a 512 bit unsigned integer (custom class) until the 512 bit integer overflows.
uint512_t i = 1;
while (i != 0) ++i;
  • Assume that the add can be done every cycle (which is not possible even for 32-bit integers as there is a dependency between instructions).

The rough estimate would be 2^512 / 2^30 adds/sec = 2^482.

I would describe this as taking infinite time. Given the above assumptions an integer width 46-47 bits would take a full day.

The statement of the number of PTX instructions still is unclear.

I am assuming here that 512-bit integer adds are coded at PTX level, not via C macros as in this answer of mine on Stackoverflow:

http://stackoverflow.com/questions/12448549/is-inline-ptx-assembly-code-powerful

Then 512-bit integer add throughput is dominated by the throughput of 32-bit add-with-carry instructions. I am not entirely sure that add-with-carry has the same throughput as simple integer add on Kepler. Assuming it does, the following speed-of-light computation should apply:

With 2880 CUDA cores running at 745 MHz (the K40 base clock), the K40 has a throughput of 715.2e9 32-bit integer adds per second. A 512-bit addition requires sixteen 32-bit addition instructions, so you would be doing at most 44.7e9 (~ 235) 512-bit additions per second, or 3.86e15 (~ 252) additions per day.

As Greg points out, at that rate counting to 2**512-1 would take much longer than the earth will continue to exist.