Anyone optimize by modifying PTX or cubin code?

I am curious if anybody does this. This should all be possible to do on CUDA or OpenCL right? I am specifically looking for a way to see if my global memory stores/loads are coalesced by just looking at the PTX and then changing my high level code to fix the problem. Would I be wasting my time with PTX for a task like this? Does anybody really know how to mess with PTX or cubin code effectively or is it still incredibly difficult to do? Let me know, I would appreciate any elaboration :)

I am curious if anybody does this. This should all be possible to do on CUDA or OpenCL right? I am specifically looking for a way to see if my global memory stores/loads are coalesced by just looking at the PTX and then changing my high level code to fix the problem. Would I be wasting my time with PTX for a task like this? Does anybody really know how to mess with PTX or cubin code effectively or is it still incredibly difficult to do? Let me know, I would appreciate any elaboration :)

Look at http://forums.nvidia.com/index.php?showtopic=159033

Look at http://forums.nvidia.com/index.php?showtopic=159033