Speeding up memory transfer to global memory


I use a struct:

struct DBDStruct
float Temperature;
float DensO;
float DensO3;
float DensN;
float DensNO;
float DensN2A;

to save values located on gridpoints. Every gridpoint consists of those six float values and is calculated by its own thread. All the structs are saved in global memory space, so every thread reads an “old” struct value from its gridpoint, calculates a new one and writes it back to global memory.

The new values are calculated in dummy variables by the kernel, like

float DummyDensO = …
float DummyDensN = …

and so on, and later written to global memory.

When the calculation of the six values in my kernel is finished I try to speed up the memory access by trying different ways of writing back to the global memory. Paradoxically all ways need nearly the same time. Perhaps someone can help me…

  1. First I tryed just:

GridValues[GridPointIndexGlobal].DensO = DummyDensO;
GridValues[GridPointIndexGlobal].DensN = DummyDensN;
GridValues[GridPointIndexGlobal].DensNO = DummyDensNO;
GridValues[GridPointIndexGlobal].DensO3 = DummyDensO3;
GridValues[GridPointIndexGlobal].DensN2A = DummyDensN2A;
GridValues[GridPointIndexGlobal].Temperature = DummyTemperature;

I thought, that costs much time, because I have 6 independent memory accesses.

  1. Then I tryed change the “Dummy”-variables from 6 floats to just the same structure, like:

struct DBDStruct DummyGridValue;


GridValues[GridPointIndexGlobal] = DummyGridValue;

Here I need only one memory access. It seems that both ways need nearly the same time. How is the second one realised internally? Does it just do the same like the first one?
Is there a fast way, like “copy this amount of bytes from here to there” which I can call from the device?

Thx for any help!!!

Consider writing your own device function… (btw, I think we discussed before…on the gridpoint problem… or may b, i am wrong. )

Hi, this is another problem i work on, solving different equations…

I think the performance in this case has everything to do with coalescing. Generally speaking, adjacent threads should access adjacent locations, and be aligned with memory for coalescing to happen. The programming guide goes into more detail.

An operation like this:
GridValues[GridPointIndexGlobal].DensO = DummyDensO;

translates into accesses to non-adjacent locations, because the DensO fields are not adjacent. This means each thread issues a separate memory transaction.

If instead of a single array of structures, you had six arrays, then you could do something like this:
GridValuesDensO[GridPointIndexGlobal] = DummyDensO;
GridValuesDensO3[GridPointIndexGlobal] = DummyDensO3;
GridValuesDensN[GridPointIndexGlobal] = DummyDensN;
GridValuesDensO[GridPointIndexGlobal] = DummyDensO;
GridValuesDensNO[GridPointIndexGlobal] = DummyDensNO;
GridValuesDensN2A[GridPointIndexGlobal] = DummyDensN2A;

Then (assuming alignment is okay) each would become a single (larger) memory transaction for the entire half warp, which would give much higher throughput.