cost of tex1Dfetch miss ?

Hi

What is the cost of tex1Dfetch miss in texture cache, in this case it is as slow as access to global memory, or it is slower for any reason ?

Second related question is about compiler behavior and device behavior, I have the following snipset:

– code
float4 data1 = tex1Dfetch(texture, index);
float4 data2 = tex1Dfetch(texture, index+1);

du = Some_Heavy_Calculations_Including_sincos_and_exp
dv = Some_Heavy_Calculations_Including_sincos_and_exp

result = (data1 * du) + (data2 * dv);
– end of code

Do the compiller reorders those two tex1Dfetch, so they occur AFTER calculating du and dv ? (both calculations does not rely on data1 and data2) ?
From what I see in output from decuda compiller reorders those fetches to minimize registers usage :/

If the device works as I think it works (the request to the texture unit is made and execution CONTINUES up until the point at the data from the fetch is needed, then if data is not ready the thread go to sleep, otherwise continues) the reordering is not the best option, since the latency of memory access is not hidden well by other calculations.
I’m right here or the device hides memory latency by switching to another warp/block, and the tex*Dfetch always blocks the thread until data is ready even if it is used several instructions later ?

Thanks for ansfers. :)

Making data1 and data2 volatile would be worth a try. It prevents the compiler from delaying the texture load.

Christian

volatile float4 data1 = tex1Dfetch(PolysTex, idx);

1>Performing Custom Build Step

1>KDKernel.cu

1>KDKernel.cu(618): error: no operator “=” matches these operands

1> operand types are: volatile float4 = float4

1>1 error detected in the compilation of “KDKernel.cpp1.ii”.

(CUDA 2.1)

is there any memory barier for compiller (similar to __syncthreads() but that only affects generated code, but do not emit any instructions ?)