What is the cost of tex1Dfetch miss in texture cache, in this case it is as slow as access to global memory, or it is slower for any reason ?
Second related question is about compiler behavior and device behavior, I have the following snipset:
float4 data1 = tex1Dfetch(texture, index);
float4 data2 = tex1Dfetch(texture, index+1);
du = Some_Heavy_Calculations_Including_sincos_and_exp
dv = Some_Heavy_Calculations_Including_sincos_and_exp
result = (data1 * du) + (data2 * dv);
– end of code
Do the compiller reorders those two tex1Dfetch, so they occur AFTER calculating du and dv ? (both calculations does not rely on data1 and data2) ?
From what I see in output from decuda compiller reorders those fetches to minimize registers usage
If the device works as I think it works (the request to the texture unit is made and execution CONTINUES up until the point at the data from the fetch is needed, then if data is not ready the thread go to sleep, otherwise continues) the reordering is not the best option, since the latency of memory access is not hidden well by other calculations.
I’m right here or the device hides memory latency by switching to another warp/block, and the tex*Dfetch always blocks the thread until data is ready even if it is used several instructions later ?
Thanks for ansfers. :)