Coalescing Custom Data Structures

Hello everyone!

Let’s say I have the following custom data structure:

typedef struct

{

  int idata;

  float fdata;

}custom_type_t;

and the following kernel:

__global__ void kernel(custom_type_t *data, int stride)

{

  int cx = (blockIdx.x * blockDim.x) + threadIdx.x;

  int cy = (blockIdx.y * blockDim.y) + threadIdx.y;

int idx = (cy * stride) + cx;

data[idx].idata = 23;

  data[idx].fData = 2.0;

}

The data array has 64x64 elements and the execution configuration is a grid of 4x4 Blocks with each block having

16x16 threads. So it should be possible to coalesce the 16x8 byte transactions of a half-warp into 1x128 byte

transaction IIRC.

When I am running this kernel on my Geforce 9600 GT the profiler output tells me that all global stores are incoherent.

gputime=[ 12.672 ] cputime=[ 28.216 ] gridSize=[ 4, 4 ] blockSize=[ 16, 16, 1 ] occupancy=[ 1.000 ] gst_coherent=[ 0 ] gst_incoherent=[ 4096 ]

I already tried to add “align(8)” but that didn’t help. After having a look at the alignedTypes-Samples, I noticed that

the variables of the structures do all have the same type. Do I have to pass the structure members as seperate arrays or

is there another way to achieve coalescing with the above data structure?

__global__ void kernel(int *data, float *fdata, int stride)

(...)

When using the following data structure all global stores are coherent:

typedef struct __align__(8)

{

  float idata; // changed from int to float

  float fdata;

}custom_type_t;

Profiler Output:

gputime=[ 7.168 ] cputime=[ 20.952 ] gridSize=[ 4, 4 ] blockSize=[ 16, 16, 1 ] occupancy=[ 1.000 ] gst_coherent=[ 512 ] gst_incoherent=[ 0 ]

Thanks a lot in advance for your help!

Bye,

kameo

Yep, it’s uncoalesced… you’re writing every other word, then you go back and write the remaining words.

Two strategies to fix this.

Best idea: make an array in shared memory of perhaps 16 custom_type_ts. Update those, then write the whole “stripe” of 32 words to global memory as words (each thread writing one word).
This is powerful because it will adapt to many situations including odd structure sizes, updating parts of structures, etc.

The other approach sometimes works and that’s to have each thread fill in a copy of the structure stored in registers, then write the whole structure at once.
This would likely work with your example but it’s not as flexible in general (and I’m not sure if 64 bits-per-thread writes are coalesced in device 1.0).

Advanced point: Some people have measured slightly higher write bandwidth when writing 64 bits per thread. Try different sizes (16, 32, 64, 128) to see if it helps your bandwidth.