Is it possible to have FP Unit and INT Unit in a same core work in parallel?

For a single block which contains 256 threads, half of the warps in the block is calculated in int, and the other half is calculated in float. Is it possible to be faster than the version only use int?

In other words, can CUDA achieve instruction level parallelism between int instruction and float instruction ?

According to the whitepare of turing at page11, “The Turing SM supports concurrent execution of FP32 andINT32 operations”, which means int32 cores now can be run parallelly with FP32 cores without blocking each other. That implies archtecture before turing is not likely to have this property.

Thanks xgr_1986 ! Your answer is really clear and helpful !

with respect to 32-bit integer arithmetic, all current GPUs have dedicated integer add units
Kepler, Volta, and Turing have dedicated integer multiply units

[url]Programming Guide :: CUDA Toolkit Documentation

If memory serves me well Kepler already had the ability to dual-issue FP32 and INT32.
I think even Tesla and Fermi may have been able to dual-issue FP32 and integer addition. Integer multiplication and FP32 could not be dual-issued because integer multiplication used the floating-point multiplier network. This probably is also the reason why Maxwell and Pascal lost the ability to dual-issue integer and floating-point multiplication.

I might be misremembering some facts. Let’s see if someone else has better memory, or time and hardware at hand to check.

EDIT: As usual, Robert was faster.

I was really confused about you guys’ answers…

edimetia3d asked whether using half int half float in a block will be faster than using all int, or could FP/INT units run concurrently in the SAME core. (Well, the contents of this post does not quite match with the title…) I thought that means whether the instruction throughput of int operations will be affected by float operations. So what exactly did in the TURING’s new integer path? Could anyone explain this paragraph from tuing whitepaper in more details?

This is NVIDIA being intentionally vague, and thus more technical marketing than information useful to programmers. E.g. Intel documents (or at least used to document, they have become more secretive as well when it comes to micro-architectural details) how each instruction maps to the available five or six issue ports, and what functional units are provided at each port.

Why are vendors being secretive about details of their processor implementations? They think it hurts their competitive position and they do not want the competition to know what their “secret sauce” is. With the effective end of Moore’s Law, the competitive battleground over the next few years will be microarchitecture, so I would expect vendors to become more guarded than ever.

In the absence of detailed information, all we can tell from the language above is that there is some amount of dual-issue capability between integer and floating-point instructions on Turing, but we don’t know what restrictions apply. So without targeted experiments, the answers to the OP questions would appear to be

  1. Maybe
  2. On some GPU architectures, with unknown restrictions

In practical terms, I would suggest to simply try to exploit potential parallelism (beyond the mix of integer and floating-point instructions that falls out of the compiler naturally) for improved performance. If you achieve success, report back here to save fellow programmers some work :-)

AFAIK the only architecture where mixed INT/FP capabilities has been clearly marketed is for Turing.

The obvious advantage is the ability to perform addressing (integer) and compute (FLOP) on “arrived data” at the same time.

So for example while we’re processing data from a previous load instruction we could theoretically process the address for the next load instruction concurrently.

Having dedicated INT32/FP32 units is not the same as using them both at the same time which was what the OP was asking for.

From the Turing whitepaper [1]:

"A third factor to consider for Turing is the introduction of integer execution units that can
execute in parallel with the FP32 CUDA cores. Analyzing a breadth of shaders from current
games, we fou
nd that for every 100 FP32 pipeline instructions there are about 35 additional
instructions that run on the integer pipeline. In a single
-pipeline architecture, these are
instructions that would have had to run serially and take cycles on the CUDA cores, b
ut in the
Turing architecture they can now run concurrently. In the timeline above, the integer pipeline
is assumed to be active for about 35% of the shading time. "

[1] https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

That is certainly true. I don’t see where anyone in this thread has disputed that or stated otherwise.

1 Like

Well at least you haven’t ;-)

Lets not waste time nit-picking semantics.

It’s possible I may be confused about what exactly is being asked. But I certainly don’t get or agree with that statement as it stands.

NVIDIA certainly marketed simultaneous use of INT32 and FP32 cores in Volta.

https://devblogs.nvidia.com/inside-volta/

"Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. "

And while I haven’t looked up a chapter-and-verse equivalent quotation for Kepler, I’d be surprised if you cannot use the int32 and fp32 hardware at the same time on Kepler. Again, I really don’t even know what is meant by that statement.

As I said, I don’t think there is any major disagreement among posters in this thread, it is mostly quibbling about terminology. This thread and any potential confusion likely wouldn’t exist if NVIDIA would provide clear and detailed information how instructions execute on their various processor architectures, and what exactly the limitations are on any dual-issuing capabilities that exist.

There is a difference between issuing and executing instructions. It is certainly possible to have a processor that has separate execution units for floating-point and integer instructions, thus allowing for their concurrent execution, but that can issue only one instruction (of either kind) per cycle. In fact this was the situation with x86 processors like the Intel 486.

Dual issue of instructions may be inhibited due to resource conflicts other than access to an execution pipe, e.g. register file access or result bus conflict. What restrictions exist in this regard on Volta and Turing is anybody’s guess.