Tuning for Turing, 32 bit integer math, 16 or 32 wide SIMDs?


Are 32 bit integer SIMD units on Turing 16 or 32 wide?


According to the docs, Turings Compute Units with 64 cores consist of four 16
wide SIMD units. Unfortunately my code is only able to run 2 Workgroups per
Compute Unit efficient, contrary to Pascal, which was able to run 4 Workgroups
per Compute Unit efficient. So my guess is that Turings integer units consist
of 32 cores, not 16, and therefore there are only two per Compute Unit.

Can anonye verify or negate my guess?

The code I run (an OpenCL chess engine):

I run 64 gpu threads per Workgroup resp. Block.