I recently converted an OpenCL kernel to CUDA, and ran `nvvp`

, and found out the below two lines of code placed a heavy toll to the speed of my code. Here is a

```
#define FL4(f) make_float4(f,f,f,f)
float4 S = FL4(r->vec.x)*normal[eid]+FL4(r->vec.y)*normal[eid+1]+FL4(r->vec.z)*normal[eid+2];
float4 T = normal[eid+3] - (FL4(r->p0.x)*normal[eid]+FL4(r->p0.y)*normal[eid+1]+FL4(r->p0.z)*normal[eid+2]);
```

where each `normal[i]`

is a `float4`

(read-only) in the global memory, and as you can see, I need to read 3x `normal`

s to compute `S`

and 4x `normal`

s (3 of them overlap with the prior line) to compute `T`

. So, a total of 64 bytes are needed for these two lines, making them responsible for nearly 90% of the memory latency of my code.

I previously thought that each global memory read in CUDA comes with a 128-byte cache line - so, reading 1x float vs 4x `float4`

cost the same. However, the memory latency I observed from these two lines are dramatically higher than what I expect for reading a single float.

Below is the output from `nvvp`

. I would like to hear what you think about on strategies to cut the memory reading cost of these two lines. One thing I want to mention that this code implements the Monte Carlo algorithm. there is very little coalescence between threads due to the random nature of the execution.