does tex2D always returns vector 4 data?

In the prgoram, texL is binded to 2D array, each element of array is one float.
In ptx code, it’s found that loading one element from texture texL(ti, tj)
tex.2d.v4.f32.f32 {$f10, $f11, $f12, $f13},texL, {$f6, $f7, $f8, $f9}
and only $f10 has the data needed, $f11~$f13 are useless.

Does this mean texture fetch is most efficient when loading vector 4 data a time, and the above waste 3/4 of the efficiency?

No, if a 4-tuple is not being fetched, the registers are available for other uses. The compiler should generate code accordingly.

but I mean, do they have to use 4 registers for each fetch, even if three registers are not effective?

Also, accessing texture memory bound to 2D array seems to involve several extra instructions for computing address in ptx code, is there a way to get rid of that?

The register optimizing step compiler will automatically get rid of them.

So they will be kept in ptx code , but removed in binary executables, is that correct?

Because now I want to estimate the performance bottlebeck , I’m not sure is every instruction in the ptx code should be counted as one instruction in real execution.


No, PTX to hardware code conversion and optimization happens at run-time.

I wouldn’t worry too much about the PTX code - don’t forget premature optimization is the root of all evil.

You should take a look at decuda (written by “wumpus” in the forums):

This tool allows you to disassemble the cubin output of ptxas and see what actual instructions and register allocations are being used in the binary uploaded to the card.