I’ve written a simple bench-marking application that can be compiled and run under VS2013. It tests the execution time of instructions. However, I’ve only written tests for bfind, brev, and popc.
I need help! There are a lot of instructions, and a lot of tests to write.
If you’re familiar with C++ in-line PTX, the job is quite easy.
Each instruction can have multiple tests that are each written as individual functions in a .h file.
A template, test_template.h, is included and contains instructions for modifying the file and writing the tests.
Once the .h file exists, simply #include it at the right spot in the .cu file.
If you wish to take on an instruction, simply reply with the instruction you’ll be taking on. Then get to work writing your .h contribution, and publish the result here. I’ll add it to the repository after looking it over.
If I can get some help, we can have a list of instruction cycle times in short order. Otherwise… well, honestly I just don’t have the time to do it myself.
Here’s the github repository: GitHub - cwm9cwm9/CUDA_PTX_ISA_Latency_Test: Benchmarks CUDA PTX instructions
If you’re curious, here’s the first result. There are two things interesting in these results. First, for some very strange reason, performing a bfind or popc on a CONSTANT results in 18 extra cycles if the input constant is larger than 0x00080000U. Put that same value into a register, and the time falls back to 6 cycles.
Bizarre.
Second, doing a bfind on a .u32 register takes just 6 cycles… considering this instruction is not native and is being converted to multiple SASS instructions you might think that doing a .u64 bfind would take three, maybe, six times as long, right? Nope! It can 60 TIMES AS LONG! That’s crazy! The PTX compiler must be branching out to some pretty inefficient external code somewhere… Lesson? Roll your own .u64 bfind if you have to.
– selected results –
000006 cycles: bfind_u32_const_0x00000001U = 0
000024 cycles: bfind_u32_const_0x00080000U = 19
000006 cycles: bfind_s32_const_0x00000001U = 0
000024 cycles: bfind_s32_const_0x00080000U = 19
000006 cycles: bfind_u32_register_0x00000001U = 0
000364 cycles: bfind_u64_const_0x0000000000000001U = 0
000006 cycles: bfind_u64_const_0x0000000100000000U = 4294967295
000364 cycles: bfind_s64_const_0x0000000000000001U = 0
000006 cycles: bfind_s64_const_0x0000000100000000U = 4294967295
000364 cycles: bfind_u64_register_0x0000000000000001U = 0
000340 cycles: bfind_u64_register_0x0000000100000000U = 32
000364 cycles: bfind_s64_register_0x0000000080000000U = 27
000340 cycles: bfind_s64_register_0x0000000100000000U = 32
000340 cycles: bfind_s64_register_0x8000000000000000U = 63
000006 cycles: brev_b32_const_0x00000000U = 0
000024 cycles: brev_b32_const_0x00000001U = 2147483648
000006 cycles: brev_b32_register_0x00000001U = 2147483648
000378 cycles: brev_b64_const_0x0000000000000001U = 9223372036854775808
000378 cycles: brev_b64_register_0x0000000000000001U = 9223372036854775808
000006 cycles: popc_b32_const_0x00000001U = 1
000024 cycles: popc_b32_const_0x00080000U = 1
000006 cycles: popc_b32_register_0x00000001U = 1
000072 cycles: popc_b64_const_0x0000000000000001U = 1
000072 cycles: popc_b64_register_0x0000000000000001U = 1