Counting flops what's in and what's out?

Hi. I want to count the floating point operations in my CUDA program, without going to the machine code if possible. I tried doing it, the most basic count, discarding even special ops like square root, I always seem to get more than what the Nsight analysis tool tells me. I am think I can trust the Nsight counter, however, I need to know what to count to give an estimate to the flop/byte count of the solver. On this topic, I would also like to know how the Instruction per Byte, as reported by the Visual profiler, is related to flop/byte. Of course, a floating point operation is counted as an instruction, my question is, how should the proportions of this ratio be divided between flops and non-flops, and what is exactly included in the non-flops(branch, conditional, jump,… every single instruction)? The trivial answer is perhaps, all of it should be flops, but any interesting problem will have to have some index arithmetic to deal, so I suppose the question can be reformulated this way: how much indexing arithmetic can a “flop” take? or, how much index arithmetic per byte is tolerable for every flop per byte?

How is instruction per byte computed? Is it by counting the number of instruction and total memory requests (which memories exactly?) of the generated machine code and taking their ratio? This method looks like it could be well off the “effective” instruction per byte, say if some threads are limited to do some extra computations, or extra memory reads… automatic instruction per byte if done as I am assuming, can be quite misleading.

The ideal balance of instructions per byte is different for each machine, how is it computed? If I am not mistaken, the ratio tells us the exact amount of instruction one has to do to match with exactly a single byte to come from memory. Which memory is it? Global memory is a safe assumption, but how about shared memory? If a program has most of its computation done in shared memory or registers, the ideal balance should be much lower, and then false assumption will be made about the reported ratio…

I seriously think that I am missing critical pieces of information, so any reply to clarify things is welcome.