@Greg Thank you for your explanation!
I see. Is such information written down anywhere publicly available or does this kind of information only come through answers in this forum, SO etc. by nvidia employees? I found this turing whitepaper but it seems to not describe the TU117 chip, which is used by the Quadro T2000 and especially not in such detail.
Oh I understand. Hm, I tried to reorder the PTX by hand, interleaving the DFMA and LDG instruction near the end (the PTX is stored when using OpenCL’s feature to store as binary file and can be used in a subsequent run), but with no change in runtime. Unfortunately I cannot analyze that in ncu using pOCL, as then the stored binary file seems to really be in binary format instead of PTX. However I was able to retrieve the SASS code by calling:
nvcc -arch=sm_75 residualFixedKernelVersions.ptx -dlink
cuobjdump -sass a_dlink.o
It reveals that the compiler seems to ignore my reordering attempts, still listing all the DFMA at the end (I also read similar information anywhere before, but wanted to try myself). So I guess there is no feasible way to manually reorder the SASS instructions (as already thoroughly discussed here). Maybe, based one of your(?) answers on SO, I might be able to insert dependent instructions, preventing ptxas to reorder things. Maybe I’ll have a look into that… And I’ll try to get access to a cluster GPU to try out the kernels!
Again, thank you very much!