write to the global memory with grouping?


I am wondering if I need to manually group data when writing to the global memory from a kernel to improve efficiency?

this specific question occurred when I worked on my Monte Carlo simulator, where when a particle hits the boundary, I need to save a bunch of metrics to the global memory. The length of the metrics depends on user’s input, for example, users can choose to save the propagation time or path length etc.

I am planning to use the shared memory to accumulate the metrics inside each thread, and then, when a particle hits the boundary, I will dump these numbers to the global memory with a loop:



numcount (1~6) is passed from outside. In this type of command, shall I make shared_metric/global_metric as float4 to reduce the total number of write transactions, or CUDA will automatically group the write to get the best performance? any potential efficiency issues?



You don’t need to make the type a float4, but structures of 4, 8, or 16 bytes will indeed be more efficient to write than other sizes.

Are you writing from one thread only, or from all threads in a warp? If you write from one thread only (and you are on a compute capability 1.2 or higher device), try using multiple threads to write all of the structs in parallel.