I rewrote a grid-based PDE solver to run on CUDA from its serial version, and unfortunately am not getting the performance I would like. Going from a 3.3 GHz i5 to either a GTX 460 or a Tesla M2070, I only see about a 2x speedup. Originally I thought this would be fixed by buffering each block’s global memory accesses, but compute capability 2.x devices seem to already do this; this method produced no speedup. I hear that ensuring coalesced global reads is important but I am not sure exactly how to do this… I’ve looked at the Programming Guide but haven’t been able to discover the best practice on my own. Is this a likely cause of my simulation’s poor performance, and if so how do I ensure that the code is working better?
Also, are there any other likely culprits for this low speedup?
PS Here is the basic structure of the code (which I run for enough timesteps to minimize any effect the initialize sequence might have on performance)
-Initialize - construct initial condition, write to GPU, etc.
-Main loop - for each timestep,
- update boundary conditions (1 kernel)
- integrate (4 kernels – I do this to prevent race conditions) This is the bulk of the computational work. The memory from the cudamemcpy operations is read into 1D global vectors and written back into the arrays the host sees at the end of the integration.
- find a new timestep (1 kernel)
As a general rule, each grid element is updated by its own thread. blockDim.x = 128. Nothing too fancy going on.
Try minimizing the data movements in global memory.
Separating work into multiple kernels is good if it avoids global synchronization problems, but it means that data needs to be read (and possibly written) multiple times. Device memory is about 10Ã— faster than host memory, but if 5 kernels lead to the data being accessed 5Ã— as often as on the host, only a 2Ã— speedup remains.
Why do you do this? Particularly, why do you do it for every timestep and not just at the beginning and end of a solver run?
Also, if you need the data to have different layouts on host and device, you can avoid one roundtrip through device memory by mapping the host memory to the device and doing host->device copy and data reshuffling in a single kernel.
The program uses a fair amount of higher-order interpolation so the mutliple kernel calls are pretty hard to get around. Point about re-accessing global memory is duly noted though.
The solver outputs the solution for given timestep values periodically and different kernels use their own 1D vectors; maybe its not perfect but I’m hoping its not a major issue right now.
I checked my program in the Visual Profiler and it reports that local memory cache misses are high and that my access pattern is not coalesced. I was sort of expecting this because the data is stored in the global vectors as structs of 7 float values… is there a way to deal with this effectively in CUDA (padding etc.) or should I really split these structs up into vectors of each component?
It depends a lot on how you access the 7 floats.
Rearranging arrays of structs into structs of arrays probably is the best solution, but padding the struct to 8 floats and accessing them as 2 float4 vectors may already get you some way, if all 7 floats shall are read by the same thread.
I’m a bit surprised you get a lot of local memory cache misses though. How much local memory (automatic variables) do you use?