Optimizing two lines of CUDA code to reduce global memory reading overhead

FangQ · April 16, 2020, 4:07am

I recently converted an OpenCL kernel to CUDA, and ran nvvp, and found out the below two lines of code placed a heavy toll to the speed of my code. Here is a

#define FL4(f) make_float4(f,f,f,f)
float4 S = FL4(r->vec.x)*normal[eid]+FL4(r->vec.y)*normal[eid+1]+FL4(r->vec.z)*normal[eid+2];
float4 T = normal[eid+3] - (FL4(r->p0.x)*normal[eid]+FL4(r->p0.y)*normal[eid+1]+FL4(r->p0.z)*normal[eid+2]);

where each normal[i] is a float4 (read-only) in the global memory, and as you can see, I need to read 3x normals to compute S and 4x normals (3 of them overlap with the prior line) to compute T. So, a total of 64 bytes are needed for these two lines, making them responsible for nearly 90% of the memory latency of my code.

I previously thought that each global memory read in CUDA comes with a 128-byte cache line - so, reading 1x float vs 4x float4 cost the same. However, the memory latency I observed from these two lines are dramatically higher than what I expect for reading a single float.

Below is the output from nvvp. I would like to hear what you think about on strategies to cut the memory reading cost of these two lines. One thing I want to mention that this code implements the Monte Carlo algorithm. there is very little coalescence between threads due to the random nature of the execution.

Topic		Replies	Views
memory latency CUDA Programming and Performance	5	4027	March 21, 2007
global memory latency CUDA Programming and Performance	12	16293	December 13, 2007
Uncoalesced reading using float4 can decrease wasting? CUDA Programming and Performance	2	296	October 16, 2023
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1940	November 1, 2010
global memory latency CUDA Programming and Performance	4	2168	June 22, 2008
Different latency in accessing global memory CUDA Programming and Performance	0	5346	January 15, 2011
latency for global memory access (in cycles) CUDA Programming and Performance	0	2329	August 8, 2008
comparision: shared mem <=> global mem actually no difference CUDA Programming and Performance	6	7624	July 21, 2008
Global memory coalescing Poor write to global memory CUDA Programming and Performance	1	2424	April 20, 2010
float4 in a register? CUDA Programming and Performance	4	2034	February 5, 2015

Optimizing two lines of CUDA code to reduce global memory reading overhead

Related topics