Results change depending on block size How do threads really behave?

I discovered a situation that emulation gives exactly same results than CPU-code but device code gives different ones, which vary with different block sizes. I assume I have some sort of race condition or my code loops get misinterpreted when compiled to device. I have used all tricks I have found in other threads (-Xopencc, -O0, etc.), so the only explanation is some sort of hidden incorrect dependency with indexes or something.

Is there a way to see what the threads are really doing during execution, understandable assembly or other similar way? I am currently most interested in the execution order and interthread dependencies, since I assume the sweet spot is there.

Of course in future I would like to see a colourful graphical debugger showing simulated flow of data (tip for the developers…) but at the time being I am afraid the method is a bit more laborous. Just can’t find it from the manuals.