Simultaneous FP32 and INT32 operations code sample

Does someone has a sample code showing the usage of simultaneous FP32 and INT32 operations for Volta or Ampere? There is documentation on how to implement this feature on CUDA?

This simultaneous execution is a function of how hardware dispatches INT32 and FP32 operations to execution resources. This is not something that is visible at the software level. You can use CUDA profiling tools to get information about the utilization of execution pipes.

So, is there no way to indicate concurrency? We have to relly on the best effort from the dispatcher?

Speaking as someone who used to work on CPU designs: Hardware solutions are pretty much always preferable to software control. This is true, for example, for branch prediction, memory prefetching, and op steering.

Can one find rare cases where “perfect” software control beats hardware control mechanisms? Yes. But those software solution tend to be very brittle, and relatively minor changes to the overall architecture tend to render them largely ineffective and sometimes downright counter-productive in terms of performance.

If there is a mix of FP32 and INT32 operations, Ampere will exploit it. Its basic execution model is that there are two pipes, one of which can do FP32 operations, and the other can do either FP32 operations or INT32 operations. So an FP32 operation can execute concurrently with another FP32 operation, or an INT32 operation. Note that this execution model remains weighted in favor or FP32 operations, but is a good fit for much of the software commonly run on GPUs, which (rule of thumb) very roughly averages a 2/3 FP32 to 1/3 INT32 mix.

I am curious: Is there a particular use case that you believe explicit software control for op-steering would handle in superior fashion to Ampere’s hardware dispatcher? If so, is that based on profiler data, or conjecture on theoretical grounds or statistical arguments?

I get it, my mistake was keep a general mindset thinking about concurrency OpenMP style. But here we are talking about kernels with threads that are concurrent by design. Thanks for the insight! I understand why a developer has little to improve on the hardware behavior in this case.

Given the current hardware preference for FP32 operations (i.e. INT32 cannot execute concurrently with INT32), for some workloads it might be helpful to performance to shift some work from integer processing to FP32 processing, which is sometimes possible where simple arithmetic is involved. I have not actually tried that.

However, I would personally consider this a kind of ninja-level programming that tends to obfuscate code and makes code maintenance more difficult, and therefore should be attempted only after “normal” optimization techniques suggested by profiler data have been exhausted. With today’s GPU hardware, the first order of business in software optimization is typically optimizing data movement. FLOPS and IOPS are frequently “too cheap to meter”.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.