Can 4 warp schedulers feed both fp32 and fp64 cores concurrently? For example, instructions c=a+b for 32 bit and f=d+e for 64 bit variables are needed to be issued. Can schedulers issue both calculations with those 4 warp schedulers without much performance cost?

If the card has 1/3 of total fp32 cores as number of fp64 cores, can it still increase some performance using mixed precision codes?

If the answer is yes, which way below is the best performing?

- instruction level parallelism (3x 32-bit calculation followed by 1x 64 bit calculation or fp64 first, 3x fp32 last)
- warp level parallelism (first 3x threads in group do 32-bit, last 1x threads do 64-bit)
- thread group level parallelism(such as whole thread group does 64-bit calc and other 4x groups do 32-bit)
- kernel level parallelism(3 kernels 32 bit, 1 kernel 64 bit, 3 kernels 32 bit, 1 kernel 64 bit, all in different cuda streams)
- dynamic parallelism (all threads do 32-bit but some threads spawn child kernels that do 64-bit too)

if there is not an exact answer for all scenarios, can this config (3x 32-bit cores + 1x 64-bit cores) get a performance boost of %15 or more when compared to pure fp32 version?

Thank you for your time