Crowd sourcing request: help me time the PTX ISA.

I’ve written a simple bench-marking application that can be compiled and run under VS2013. It tests the execution time of instructions. However, I’ve only written tests for bfind, brev, and popc.

I need help! There are a lot of instructions, and a lot of tests to write.

If you’re familiar with C++ in-line PTX, the job is quite easy.

Each instruction can have multiple tests that are each written as individual functions in a .h file.

A template, test_template.h, is included and contains instructions for modifying the file and writing the tests.

Once the .h file exists, simply #include it at the right spot in the .cu file.

If you wish to take on an instruction, simply reply with the instruction you’ll be taking on. Then get to work writing your .h contribution, and publish the result here. I’ll add it to the repository after looking it over.

If I can get some help, we can have a list of instruction cycle times in short order. Otherwise… well, honestly I just don’t have the time to do it myself.

Here’s the github repository: GitHub - cwm9cwm9/CUDA_PTX_ISA_Latency_Test: Benchmarks CUDA PTX instructions

If you’re curious, here’s the first result. There are two things interesting in these results. First, for some very strange reason, performing a bfind or popc on a CONSTANT results in 18 extra cycles if the input constant is larger than 0x00080000U. Put that same value into a register, and the time falls back to 6 cycles.

Bizarre.

Second, doing a bfind on a .u32 register takes just 6 cycles… considering this instruction is not native and is being converted to multiple SASS instructions you might think that doing a .u64 bfind would take three, maybe, six times as long, right? Nope! It can 60 TIMES AS LONG! That’s crazy! The PTX compiler must be branching out to some pretty inefficient external code somewhere… Lesson? Roll your own .u64 bfind if you have to.

– selected results –

000006 cycles: bfind_u32_const_0x00000001U = 0
000024 cycles: bfind_u32_const_0x00080000U = 19

000006 cycles: bfind_s32_const_0x00000001U = 0
000024 cycles: bfind_s32_const_0x00080000U = 19

000006 cycles: bfind_u32_register_0x00000001U = 0

000364 cycles: bfind_u64_const_0x0000000000000001U = 0
000006 cycles: bfind_u64_const_0x0000000100000000U = 4294967295

000364 cycles: bfind_s64_const_0x0000000000000001U = 0
000006 cycles: bfind_s64_const_0x0000000100000000U = 4294967295

000364 cycles: bfind_u64_register_0x0000000000000001U = 0
000340 cycles: bfind_u64_register_0x0000000100000000U = 32

000364 cycles: bfind_s64_register_0x0000000080000000U = 27
000340 cycles: bfind_s64_register_0x0000000100000000U = 32

000340 cycles: bfind_s64_register_0x8000000000000000U = 63

000006 cycles: brev_b32_const_0x00000000U = 0
000024 cycles: brev_b32_const_0x00000001U = 2147483648

000006 cycles: brev_b32_register_0x00000001U = 2147483648

000378 cycles: brev_b64_const_0x0000000000000001U = 9223372036854775808

000378 cycles: brev_b64_register_0x0000000000000001U = 9223372036854775808

000006 cycles: popc_b32_const_0x00000001U = 1
000024 cycles: popc_b32_const_0x00080000U = 1

000006 cycles: popc_b32_register_0x00000001U = 1

000072 cycles: popc_b64_const_0x0000000000000001U = 1

000072 cycles: popc_b64_register_0x0000000000000001U = 1

Keep in mind that PTX is a virtual ISA hat also does double duty as a compiler intermediate format. The PTX code is compiled to machine code (also known as SASS) by the ptxas component of the CUDA compiler. ptxas is an optimizing compiler, not just an assembler as the name may suggest.

You would want to examine all code at the SASS, rather than the PTX, level. If you do that, you will find that numerous PTX instructions do not map 1:1 to machine instructions, but are in fact emulated, sometimes by short in-line sequences, sometimes by called subroutines that can be fairly lengthy (try 64-bit signed division as an example). Since the GPUs are 32-bit processors, all 64-bit integer operations are emulated, with the exception of conversions between 64-bit integers and floating-point data types, which are handled by the double precision unit.

Because there are so many emulated PTX instructions, usually only those that the compiler generates are well tuned. In addition, the specification of PTX instructions sometimes requires more elaborate emulation than one might naively expect.

If you code at the PTX level, you may therefore occasionally encounter slow emulations. If performance of some emulated instructions is important for your use case, and you see room for improvement in the emulation, I would suggest filing enhancement requests. You can do that via the bug reporting form, by prefixing the subject line with “RFE:”.

If you’re interested in this kind of thing on the Maxwell arch, my assembler makes measuring this much simpler.

--:-:-:-:1      CS2R clock1, SR_CLOCKLO;
--:-:1:-:2      FLO.U32 result, a;
01:-:-:-:6      CS2R clock2, SR_CLOCKLO;
--:-:-:-:1      IADD clock1, clock2, -clock1;

So with the above you can easily see the latency of FLO (find leading one) is about 12 clocks in low occupancy conditions. For higher occupancy it might take as much as 26 clocks. That variable latency is the reason you need to use the hardware barrier resources to synchronize the result.

Another thing you can do is put a bunch in a big loop and measure the throughput. In this case it’s a quarter that of the cuda cores.

Lastly you can stack a few more between the clock calls and measure the instruction queue depth. You’ll need to remove the sync flags for this. It takes about 40 FLO instructions before more clocks transpire than the number of intervening FLO instructions. With a latency of 12 and a throughput of 1/4 we know 8 of those instructions will have been processed. So that means the queue depth is probably around 32.

Most programs aren’t going to be using this instruction to such an extreme, and typically the latency of it can be completely hidden by other instructions and/or warps. But this kind of information can still be pretty useful if you’re manually scheduling your instructions (or writing your own scheduler).

I realize that PTX is not the lowest level available, but it’s what I use to inline in c++, and it’s well documented by nVidia. I’m guessing the hardwired instruction set will change as gpu generations pass, while the goal is to keep PTX stable.

Some additional things to be aware of when “timing” PTX like this:

  1. Results may change based on which device you compile for.
  2. Results may change based on which CUDA version you use.
  3. Results may change based on driver version, and GPU used, in the case of JIT
  4. Since ptxas is an “optimizing compiler” sequences of PTX may be optimized in different ways than the representation of a single PTX instruction (in SASS) might be using your “one-off” method. Therefore, attempting to infer behavior of a sequence of PTX instructions based on the individual PTX measurements may have some inaccuracy associated with it.

So although at the intermediate level, PTX does indeed have some “stability” to it, the actual measurement can ultimately only be done via SASS, and as a result the variability creeps back in (through the back door).

Probably you are aware of all this. Just over-communicating.

I would think C or C++ would be a much more productive language than PTX, so it’s interesting to hear that folks are making extensive use of PTX as a primary source language.

It seems like you could automate the task of header file creation using e.g. a perl script. Creating the description file to feed into the script would still take some work. It might even be possible to partially automate that by clever processing/parsing of section 8.7 of the PTX ISA document.

Actually, I’ve been told that Pascal will be largely binary compatible with Maxwell. It’s Volta where the arch will change more significantly.

As for programming in ptx, that’s what I first tried doing. The language is extremely awkward to use in practice and the sass that comes out of it often looks nothing like what went in. So if anything, it’s more frustrating than working in cuda-c. I now just use ptx for inline functionality that’s missing from the c api (or too much of a pain to find). The ptx ISA is extremely well documented and fairly easy to use in a limited way.

However, some kernels may run frequently and be computationally dense enough to warrant spending the time to hand assemble. And that’s even if you know you’ll have to redo it each time a new arch comes out. Having access to low level hardware details can also help you write better high level code. It would be nice if nvidia provided more support for this (and I’m talking about: “Here’s our ISA, use at your own risk. Official support will be mainly offered on the higher level APIs”).

For hand-tuning of code, or doing things the compiler can’t/won’t, certainly SASS makes sense. (And I would argue that PTX does not, for all the reasons discussed here.) SASS coding is a valuable approach for a library that will get used a lot. But probably we wouldn’t even be having this discussion except for maxas. And I agree that SASS is poorly documented, if you can even apply the word “documentation” to it.

For Kepler Architecture GPUs, as SM35/37, you are welcome to try AsKepler, it is a online compiler; Details as this link: Kepler Assembler - CUDA Programming and Performance - NVIDIA Developer Forums

Hey, thanks for the mention of maxas. I’ve been wondering if there was anything out there equivalent to AsKepler (or also a perl program on GitHub called KeplerAs).
Is Maxwell SASS much different from Kepler SASS in its binary form, by the way?
And out of curiosity, do I need to figure out the instruction latency in either form of SASS code and provide that information so that the extra control info in the binary code can be generated, or do the assemblers do that automatically. I understand that the hardware dispatcher depends on this information to be sure it doesn’t issue an instruction before all its operands are ready.