I benchmarked the DP4A operation and VMIN4 operation on int8_t datatype (int32_t accumulate) using the CUTLASS library, so trying to fully take advantage of tiled structures. The DP4A and VMIN4 instructions both operate on 2 vectors each containing 4 8-bit signed integers.
The number of SASS instructions could indeed be the cause, thanks for pointing this out, I will check that.
If the unsigned 8-bit integer is processed as one’s or two’s complement, Its as easy as checking those two 8 bits integers for the first occurrence of a 1 (in the least significant 7 bits), then return that integer (if the most significant bits are equal). If the most significant bit is not equal, the result is clear immediately. unsigned 8 bit integer multiplication is less straightforward?
I do not benchmark the ReferenceGemm_kernel, that code is ran to determine if the output of the CutlassSgemmNN kernel is correct (it should equal the output of the ReferenceGemm_kernel). I benchmark the CutlassSgemmNN code.
I’ve checked just now. vmin4 seems to compile to a SASS sequence (not terribly long, but its not a single instruction). dp4a compiles to a single instruction (on cc6.1). There would certainly be a performance difference favoring dp4a due to this, but I’m not sure it accounts for what you are showing which is well over an order of magnitude difference.
I would suggest you dispense with this line of reasoning and instead ask what operations (i.e. instructions) actually get performed at the device code level, and what is the throughput of these instructions. You might think that byte-wise multiplication (dp4a) is “hard”, however on cc6.1 there is dedicated hardware and instruction support for it. On an ops/s basis, on cc6.1, it actually has substantially higher throughput than any comparable math (e.g. 32-bit integer multiply, 16-bit floating point multiply, 32-bit floating point multiply, 64-bit floating point multiply)
Would you know of a method on how to correctly measure throughput of these two instructions? I think with the modified CUTLASS code I came relatively close already? They operate on the same input matrices.
Actually I should probably amend my previous statement. When I first looked at vmin4, I was using the intrinsic. The compiled sequence didn’t seem that long to me (looking again now, it seems to be ~10 instructions). However when I switched to the inline assembly version indicated in your link, the sequence is a bit longer. I don’t know for sure, but I think it’s possible that it could give rise to an order of magnitude difference between dp4a and vmin4 performance. Again, the dp4a compiles to basically a single instruction. It looks like the inline assembly vmin4 is compiling to ~20 instructions.
In any event, without getting to a calibrated measurement, I’m pretty confident in the idea that I would expect dp4a to be faster than vmin4, based on studying the SASS.
I don’t know what modifications you made to the cutlass code, nor do I know what test case you ran, so I can’t comment further.
(my initial read was not very careful - even at 10 instructions to 1, its reasonable to conclude that there might be an order of magnitude difference)
Let me state it again: dp4a is the highest math operation throughput in ops/s on a cc6.1 architecture. In a nutshell, it runs at 4x the FP32 rate (again, when viewed as ops/s). I don’t think you’re going to find any other math operation of any type that will be faster.
Thanks! I analyzed SASS output for different SM architectures (lets have a look at SM_30 and SM_61). ONLY in SM_30 the code compiles to a single VMNMX instruction (very efficient I suppose), but after SM_30 (like SM_50 and SM_61) it compiles to a big sequence using IMNMX instructions. Is this because in hardware with CC >= 5.0 the support for the VMNMX device instruction is dropped? If that would be the case, it would be impossible to run SM_30 VMIN4 code on SM_50 and later, right?
CUDA (it is very simple code):
SM_30 output (! pay attention to the single VMNMX instruction):
SM_61 output (! much longer sequence of instructions for VMIN4):
Is this because the compiler doesnt optimize VMIN4 correctly for SM_61?
For the first question, to be a bit more general I would say that various video SIMD PTX instructions may compile to a single SASS instruction on some architectures whereas they may compile to instruction sequences on others. This is documented here (specifically the line “All other SIMD video instructions” for the case discussed here. The numerical value of 32 for the Kepler compute capability indicates that there is a single SASS instruction for it as well as a functional unit in the SM designated to handle that instruction)
For the second question, compute_30 (i.e. PTX) code could be run on a future architecture but sm_30 (i.e. SASS) code cannot. compute_30 PTX may contain a single SIMD video opcode, but when JIT-compiled to SASS for another architecture, it will translate into a SASS sequence, not a single instruction.
No, the difference between the compiler emitting a single instruction in the sm_30 case, and an instruction sequence in the sm_61 case is for the reason already covered above. There is a difference in the hardware support between these two architectures.
If you are interested in the vmin4 case, you might want to experiment with the intrinsic, rather than using inline assembly. According to my testing, the intrinsics generate less SASS code. I’m not sure why that would be the case. It’s also possible that I may have made some mistake in my observations.
The __vmins4 intrinsic generates less SASS code because it does not perform the addition I guess? To get the same operation done as vmin4.s32.s32.s32.add PTX one needs to perform __vmins4 followed by something that sums 4 8-bit integers to a single 32-bit integer. I think i will test performance using those 2 setups (current vmin4.s32.s32.s32.add PTX setup versus those SIMD intrinsics).
Back with news! I updated my VMIN4 code with vmins4 intrinsic and ran the benchmark again, and I got a 2 times performance improvement for every matrix size on GTX1060!! nice!!!
See the lines of code here:
This does however mean that vmin4.s32.s32.s32.add PTX code is not compiled to device code efficiently. See here the difference with using intrinsics:
From memory (I created the intrinsics, ca. 2013): The PTX instructions typically support more functionality than the intrinsics. The intrinsics should always be preferred over inline PTX, unless that extra functionality is needed.
On sm_30, VMNMX is the instruction underlying the PTX instructions vmin and vmax. The hardware has just one instruction which distinguishes the two variants via a predicate (TRUE/FALSE), I think. The same approach is used for IMNMX.
Only the folks at NVIDIA know for sure what their hardware supports. As I recall, after sm_30 pretty much all of these byte-wise SIMD instructions were ripped out except for sum-of-absolute-differences. IIRC, on sm_30, the SIMD instructions had 1/4 the throughput of regular integer instructions, so the simple ones among them could be emulated almost as fast as using the native implementation. Since almost no software used them, they became an unnecessary burden on the hardware. GPU architecture takes a quite RISC-y approach to building processors.
The implementation of most of the SIMD intrinsics is very bit-twiddly. For __vmins4() it would be something similar to the code below. Before the SIMD intrinsics became part of CUDA, they were posted by NVIDIA as a header file implementation under a BSD license; maybe you can still find that file on the internet. Obviously the implementations could have changed since then, but it would give you an idea of how this stuff works under the hood.
#define UINT32_H4 0x80808080U // byte-wise sign bits (MSBs)
#define UINT32_L4 0x01010101U // byte-wise LSBs
/* extract sign bits and convert them into a boolean, byte-wise */
static __host__ __device__ uint32_t sign_to_bool4 (uint32_t a)
{
return (a & UINT32_H4) >> 7;
}
/* extend sign bits into mask, byte-wise */
static __host__ __device__ uint32_t sign_to_mask4 (uint32_t a)
{
#if (__CUDA_ARCH__ >= 200)
asm ("prmt.b32 %0,%0,0,0xba98;" : "+r"(a)); // convert MSBs to masks
#else
a = a & UINT32_H4; // isolate sign bits
a = a + a - (a >> 7); // extend them to full byte to create mask
#endif
return a;
}
/* extend boolean into mask, byte-wise */
static __host__ __device__ uint32_t bool_to_mask4 (uint32_t a)
{
return (a << 8) - a;
}
static __host__ __device__ uint32_t vhaddu4 (uint32_t a, uint32_t b)
{
/* Peter L. Montgomery's observation (newsgroup comp.arch, 2000/02/11,
https://groups.google.com/d/msg/comp.arch/gXFuGZtZKag/_5yrz2zDbe4J):
(A+B)/2 = (A AND B) + (A XOR B)/2.
*/
return (a & b) + (((a ^ b) >> 1) & ~UINT32_H4);
}
static __host__ __device__ uint32_t vsetles4 (uint32_t a, uint32_t b)
{
#if (__CUDA_ARCH__ >= 300) && (__CUDA_ARCH__ < 500)
uint32_t t = 0;
asm ("vset4.s32.s32.le %0,%1,%2,%0;" : "+r"(t) : "r"(a), "r"(b));
return t;
#else
return sign_to_bool4 (vhaddu4 (a, ~b) ^ (a ^ ~b));
#endif
}
static __host__ __device__ uint32_t vcmples4 (uint32_t a, uint32_t b)
{
#if (__CUDA_ARCH__ >= 300) && (__CUDA_ARCH__ < 500)
return bool_to_mask4 (vsetles4 (a, b));
#else
return sign_to_mask4 (vhaddu4 (a, ~b) ^ (a ^ ~b));
#endif
}
static __host__ __device__ uint32_t vmins4 (uint32_t a, uint32_t b)
{
#if (__CUDA_ARCH__ >= 300) && (__CUDA_ARCH__ < 500)
uint32_t t = 0;
asm ("vmin4.s32.s32.s32 %0,%1,%2,%0;" : "+r"(t) : "r"(a), "r"(b));
return t;
#else
uint32_t t = vcmples4 (a, b);
return (a & t) | (b & ~t);
#endif
}