#define FL4(f) make_float4(f,f,f,f) float4 S = FL4(r->vec.x)*normal[eid]+FL4(r->vec.y)*normal[eid+1]+FL4(r->vec.z)*normal[eid+2]; float4 T = normal[eid+3] - (FL4(r->p0.x)*normal[eid]+FL4(r->p0.y)*normal[eid+1]+FL4(r->p0.z)*normal[eid+2]);
normal[i] is a
float4 (read-only) in the global memory, and as you can see, I need to read 3x
normals to compute
S and 4x
normals (3 of them overlap with the prior line) to compute
T. So, a total of 64 bytes are needed for these two lines, making them responsible for nearly 90% of the memory latency of my code.
I previously thought that each global memory read in CUDA comes with a 128-byte cache line - so, reading 1x float vs 4x
float4 cost the same. However, the memory latency I observed from these two lines are dramatically higher than what I expect for reading a single float.
Below is the output from
nvvp. I would like to hear what you think about on strategies to cut the memory reading cost of these two lines. One thing I want to mention that this code implements the Monte Carlo algorithm. there is very little coalescence between threads due to the random nature of the execution.