Are there anything like DPP instructions other than shfl.sync?

user10463 · October 26, 2021, 3:56am

AMD has DPP instructions which can do cross lane shuffling in one instruction.
I know CUDA has PTX ISA like shfl.sync. But from my experiment shfl.sync is pretty slow.
I wonder if there is faster choices than shfl.sync

njuffa · October 26, 2021, 6:02am

The SHFL.IDX instruction of NVIDIA GPUs can do cross lane shuffling in one instruction. It can be conveniently accessed via CUDA intrinsics, no need to drop to PTX level.

Per documentation, the throughput of SHFL (of any flavor) is 32 per clock cycle per multiprocessor, which is 1/2 or 1/4 of the throughput of simple FP32 arithmetic instructions, depending on GPU architecture (some architectures have throughput of 64 for FP32 arithmetic, others have throughput of 128).

That seems pretty fast to me. What’s the throughput of AMD’s DPP instructions compared to that of simple FP32 arithmetic instructions on AMD GPUs?

[Later:]

I have now taken notice of the footnote in the documentation that says SHFL throughput for compute capability 7.5 (Turing) is only 16 per cycle per multiprocessor. Bummer. If you are currently using a GPU with CC 7.5 (sm_75), you may want to try one with a different architecture.

rs277 · October 26, 2021, 7:55am

Some interesting DPP info here, about halfway down the page:

“Spending additional instruction to move data (even with the swizzle and permutes) is unnecessary—now, most of the vector instructions can do cross-lane reading at full throughput.”

njuffa · October 26, 2021, 8:52am

Thanks for the pointer. I am starting to get the impression that AMD and NVIDIA follow different design philosophies, with NVIDIA putting emphasis on simplicity of the compute units and then just plunking down as many units as the power envelop will allow. So far NVIDIA’s approach seems to work well for performance / watt improvements from generation to generation.

rs277 · October 26, 2021, 6:46pm

Yes, good to see the diversity and while I very much appreciate the Cuda eco system and comprehensive documentation thereof, it’s refreshing to see that AMD provide comprehensive instruction set manuals.

njuffa · October 26, 2021, 9:12pm

My impression is that we’ll see pigs fly before NVIDIA does the same. NVIDIA’s model of software developer interaction seems to be along the lines of Apple’s: Provide many high-level high-performance frameworks for programmers to use, but hide hardware details. The financial pay-off appears to validate that approach.

rs277 · October 26, 2021, 9:50pm

To what end though? I’m guessing that the effort required to maintain the docs as each generational change occurs, is not not deemed worthwhile given the numbers that would actually use them. That said, they have to document for internal consumption anyway…

I can’t help but wonder how much of Scott Gray’s life would have been available for other pursuits, had he not had to reverse engineer, in order to work around less than stellar supplied tools.

We shouldn’t have to depend on third parties, for basic performance metrics, (the “Dissecting Volta/Turing” papers]. The “Best Practices Guide” states:
"If all threads of a warp access the same location, then constant memory can be as fast as a register access. " CUDA Best Practices Guide — CUDA C++ Best Practices Guide 13.2 documentation

This may be the case, but Figure 3.9 in the “Dissecting the NVidia Turing T4 GPU via Microbenchmarking” paper [1903.07486v1] Dissecting the NVidia Turing T4 GPU via Microbenchmarking
shows at best, according to their measurement, latency, at around 30 cycles, more on a par with shared memory. Table 3.1 coroborates across all recent generations.

Nvidia should be supplying this stuff.

njuffa · October 26, 2021, 10:07pm

One could speculate why it might be beneficial to not publicly disclose hardware details, including a detailed ISA specification. (1) Hide ideas that might benefit the competition (2) Greater flexibility in making ISA changes (3) Paper over hardware bugs (compare Pentium division bug).

I am sure we could come up with more potential reasons with a bit of brainstorming. The reality is that once a multitude of usable high-level frameworks is offered this will suffice for 99.9% of programmers. There would have to be very good (exceptional) reasons to spend extra effort to satisfy the desires of the remaining 0.1%.

njuffa · October 26, 2021, 10:20pm

From personal practical experience, that earlier statement about uniform access to constant memory being roughly equivalent to register access does apply. Within generous bounds, latency is not of concern in GPU programming. Many basic latencies in GPUs, from arithmetic instructions, to register access, to caches, are longer than in CPUs (for which latency is a greater concern). In general, modern programming, and GPU programming in particular, is focused on throughput. Exceptions (e.g. high-frequency trading) confirm the rule :-)

Easily observable changes in recent ISAs (and observable changes in code-generation strategy) suggest however, that in newer architectures (including Turing) the use of constant memory may no longer be as attractive as before. For one, FP32 arithmetic instructions are now capable of holding a complete (instead of a truncated) immediate constant FP32 operand, eliminating the need to pull it from constant memory. Also, there is an increase in the use of move immediate instructions to load constant data, rather than using a constant-memory bank.

So far, I have chalked this up to the need for improved energy efficiency, rather than the need to boost performance. But it is certainly possible that in the latest architectures use of constant memory is no longer as competitive as in earlier architecture when compared to pulling data from registers in terms of throughput.

However, to first order this is not something I do not need to worry about when writing CUDA software, as long as the ISA designers and compiler writers Do The Right Thing.

rs277 · October 26, 2021, 10:34pm

Agree. I fully accept as one who enjoys knowing what goes on under the hood and appreciates the ability to be able to try things out, that I stand well outside the demographic. :-)

njuffa · October 26, 2021, 10:41pm

As someone who helped build microprocessors for a living before getting back into software, I would certainly appreciate detailed and authoritative information about GPU microarchitectures and ISAs becoming available. Realistically and based on past observation, I just don’t see that happening.

system · November 9, 2021, 10:41pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why is __shfl slower than shared memory CUDA Programming and Performance	7	6183	November 27, 2014
Is there a document about in which hardware unit(ie. ALU FMU...) an instruction is executed? CUDA Programming and Performance	35	3811	October 5, 2022
Pipeline operator forwarding for integer instructions in CUDA CUDA Programming and Performance cuda , kernel	25	704	July 15, 2024
SHFL for Ampere and Volta CUDA NVCC Compiler	0	461	December 22, 2022
Integer instructions performance on Kepler CUDA Programming and Performance	6	3526	January 5, 2014
GPGPU readback, MOVNTDQA, DPPS, drivers When will this get implemented? CUDA Programming and Performance	18	18846	March 20, 2008
CUDA Pro Tip: Do The Kepler Shuffle Technical Blog	3	591	November 5, 2018
Instruction throughput table CUDA Programming and Performance	0	6152	November 17, 2011
Cannot achieve max shuffle bandwidth CUDA Programming and Performance	1	375	November 20, 2023
AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance CUDA Programming and Performance	70	48243	July 2, 2013

Are there anything like DPP instructions other than shfl.sync?

Related topics