thanks njuffa for the helpful feedback. I will read the Best Practices Guide in more details.The last time I read the guide carefully was about 5 years ago …
by disabling the offending if() block we found earlier, I was able to run the PC sampling profiler again. I now see some new findings. I would like to get some help on interpreting the assembly code.
Over the last week, I’ve implemented a new RNG (xorshift128+) with a hope to get better speed. I now see some different patterns in the PC sampling profiler output. Memory dependency, previously only accounts for 2-3% of the latency, now returns back to the scene, even though the overall running speed is pretty decent (24k photon/ms on 980Ti, higher than what it was before).
I notice that almost 100% of this memory dependency comes from a single line of code (line#622),
if(idx1d!=idx1dold && idx1dold>0 && mediaidold){
which accounts for 1/3 of the total run time now. In the assembly, almost 100% of the memory dependency comes from the below single assembly line:
I2I.S32.S16 R57, R6;
I am attaching the screenshot of the PC sampling benchmark output. The hotspot in both the source code (top-left) and the assembly code (top-right) are highlighted.
variable mediaidold is a char (label of the media), read from the global memory array media on line#609. I suspect the I2I.S32.S16 instruction was for retrieving the value of mediaidold? is there a document I can read more about assembly instructions?
PS: I just changed line #609 from mediaidold=media[idx1d]; to mediaidold=mediaid; MCX got a nice 40% speed improvement ! (jumping from 24k photon/ms to 34k photon/ms on Maxwell)! I guess this confirms my suspicion.
On the other hand, the improvement on Fermi and Kepler was not as exciting, only about 10%. In comparison, my 980Ti is about 10x faster than 1 core of 590 (Fermi). I wish I can see what happened on the older GPU architecture. Unfortunately the PC sampling profiler only runs on Maxwell.