Are there anything like DPP instructions other than shfl.sync?

AMD has DPP instructions which can do cross lane shuffling in one instruction.
I know CUDA has PTX ISA like shfl.sync. But from my experiment shfl.sync is pretty slow.
I wonder if there is faster choices than shfl.sync

The SHFL.IDX instruction of NVIDIA GPUs can do cross lane shuffling in one instruction. It can be conveniently accessed via CUDA intrinsics, no need to drop to PTX level.

Per documentation, the throughput of SHFL (of any flavor) is 32 per clock cycle per multiprocessor, which is 1/2 or 1/4 of the throughput of simple FP32 arithmetic instructions, depending on GPU architecture (some architectures have throughput of 64 for FP32 arithmetic, others have throughput of 128).

That seems pretty fast to me. What’s the throughput of AMD’s DPP instructions compared to that of simple FP32 arithmetic instructions on AMD GPUs?

[Later:]

I have now taken notice of the footnote in the documentation that says SHFL throughput for compute capability 7.5 (Turing) is only 16 per cycle per multiprocessor. Bummer. If you are currently using a GPU with CC 7.5 (sm_75), you may want to try one with a different architecture.

1 Like

Some interesting DPP info here, about halfway down the page:

“Spending additional instruction to move data (even with the swizzle and permutes) is unnecessary—now, most of the vector instructions can do cross-lane reading at full throughput.”

1 Like

Thanks for the pointer. I am starting to get the impression that AMD and NVIDIA follow different design philosophies, with NVIDIA putting emphasis on simplicity of the compute units and then just plunking down as many units as the power envelop will allow. So far NVIDIA’s approach seems to work well for performance / watt improvements from generation to generation.

Yes, good to see the diversity and while I very much appreciate the Cuda eco system and comprehensive documentation thereof, it’s refreshing to see that AMD provide comprehensive instruction set manuals.

My impression is that we’ll see pigs fly before NVIDIA does the same. NVIDIA’s model of software developer interaction seems to be along the lines of Apple’s: Provide many high-level high-performance frameworks for programmers to use, but hide hardware details. The financial pay-off appears to validate that approach.

To what end though? I’m guessing that the effort required to maintain the docs as each generational change occurs, is not not deemed worthwhile given the numbers that would actually use them. That said, they have to document for internal consumption anyway…

I can’t help but wonder how much of Scott Gray’s life would have been available for other pursuits, had he not had to reverse engineer, in order to work around less than stellar supplied tools.

We shouldn’t have to depend on third parties, for basic performance metrics, (the “Dissecting Volta/Turing” papers]. The “Best Practices Guide” states:
"If all threads of a warp access the same location, then constant memory can be as fast as a register access. " Best Practices Guide :: CUDA Toolkit Documentation

This may be the case, but Figure 3.9 in the “Dissecting the NVidia Turing T4 GPU via Microbenchmarking” paper [1903.07486v1] Dissecting the NVidia Turing T4 GPU via Microbenchmarking
shows at best, according to their measurement, latency, at around 30 cycles, more on a par with shared memory. Table 3.1 coroborates across all recent generations.

Nvidia should be supplying this stuff.

One could speculate why it might be beneficial to not publicly disclose hardware details, including a detailed ISA specification. (1) Hide ideas that might benefit the competition (2) Greater flexibility in making ISA changes (3) Paper over hardware bugs (compare Pentium division bug).

I am sure we could come up with more potential reasons with a bit of brainstorming. The reality is that once a multitude of usable high-level frameworks is offered this will suffice for 99.9% of programmers. There would have to be very good (exceptional) reasons to spend extra effort to satisfy the desires of the remaining 0.1%.

From personal practical experience, that earlier statement about uniform access to constant memory being roughly equivalent to register access does apply. Within generous bounds, latency is not of concern in GPU programming. Many basic latencies in GPUs, from arithmetic instructions, to register access, to caches, are longer than in CPUs (for which latency is a greater concern). In general, modern programming, and GPU programming in particular, is focused on throughput. Exceptions (e.g. high-frequency trading) confirm the rule :-)

Easily observable changes in recent ISAs (and observable changes in code-generation strategy) suggest however, that in newer architectures (including Turing) the use of constant memory may no longer be as attractive as before. For one, FP32 arithmetic instructions are now capable of holding a complete (instead of a truncated) immediate constant FP32 operand, eliminating the need to pull it from constant memory. Also, there is an increase in the use of move immediate instructions to load constant data, rather than using a constant-memory bank.

So far, I have chalked this up to the need for improved energy efficiency, rather than the need to boost performance. But it is certainly possible that in the latest architectures use of constant memory is no longer as competitive as in earlier architecture when compared to pulling data from registers in terms of throughput.

However, to first order this is not something I do not need to worry about when writing CUDA software, as long as the ISA designers and compiler writers Do The Right Thing.

Agree. I fully accept as one who enjoys knowing what goes on under the hood and appreciates the ability to be able to try things out, that I stand well outside the demographic. :-)

As someone who helped build microprocessors for a living before getting back into software, I would certainly appreciate detailed and authoritative information about GPU microarchitectures and ISAs becoming available. Realistically and based on past observation, I just don’t see that happening.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.