Coalescing Custom Data Structures

kameo · September 2, 2009, 10:36pm

Hello everyone!

Let’s say I have the following custom data structure:

typedef struct

{

  int idata;

  float fdata;

}custom_type_t;

and the following kernel:

__global__ void kernel(custom_type_t *data, int stride)

{

  int cx = (blockIdx.x * blockDim.x) + threadIdx.x;

  int cy = (blockIdx.y * blockDim.y) + threadIdx.y;

int idx = (cy * stride) + cx;

data[idx].idata = 23;

  data[idx].fData = 2.0;

}

The data array has 64x64 elements and the execution configuration is a grid of 4x4 Blocks with each block having

16x16 threads. So it should be possible to coalesce the 16x8 byte transactions of a half-warp into 1x128 byte

transaction IIRC.

When I am running this kernel on my Geforce 9600 GT the profiler output tells me that all global stores are incoherent.

gputime=[ 12.672 ] cputime=[ 28.216 ] gridSize=[ 4, 4 ] blockSize=[ 16, 16, 1 ] occupancy=[ 1.000 ] gst_coherent=[ 0 ] gst_incoherent=[ 4096 ]

I already tried to add “align(8)” but that didn’t help. After having a look at the alignedTypes-Samples, I noticed that

the variables of the structures do all have the same type. Do I have to pass the structure members as seperate arrays or

is there another way to achieve coalescing with the above data structure?

__global__ void kernel(int *data, float *fdata, int stride)

(...)

When using the following data structure all global stores are coherent:

typedef struct __align__(8)

{

  float idata; // changed from int to float

  float fdata;

}custom_type_t;

Profiler Output:

gputime=[ 7.168 ] cputime=[ 20.952 ] gridSize=[ 4, 4 ] blockSize=[ 16, 16, 1 ] occupancy=[ 1.000 ] gst_coherent=[ 512 ] gst_incoherent=[ 0 ]

Thanks a lot in advance for your help!

Bye,

kameo

SPWorley · September 2, 2009, 10:55pm

Yep, it’s uncoalesced… you’re writing every other word, then you go back and write the remaining words.

Two strategies to fix this.

Best idea: make an array in shared memory of perhaps 16 custom_type_ts. Update those, then write the whole “stripe” of 32 words to global memory as words (each thread writing one word).
This is powerful because it will adapt to many situations including odd structure sizes, updating parts of structures, etc.

The other approach sometimes works and that’s to have each thread fill in a copy of the structure stored in registers, then write the whole structure at once.
This would likely work with your example but it’s not as flexible in general (and I’m not sure if 64 bits-per-thread writes are coalesced in device 1.0).

Advanced point: Some people have measured slightly higher write bandwidth when writing 64 bits per thread. Try different sizes (16, 32, 64, 128) to see if it helps your bandwidth.

Topic		Replies	Views
hwo to make float2 and float4 data coalesced? CUDA Programming and Performance	1	3577	May 27, 2008
Aligned structures problem CUDA Programming and Performance	1	1307	July 31, 2008
coalesced access of a struct of double's is this rite? CUDA Programming and Performance	14	7935	June 29, 2009
Additional requirements for coalesced reads/writes structs must map to built-in type CUDA Programming and Performance	7	3194	December 31, 2007
Getting coalesced reads and stores using a struct CUDA Programming and Performance	3	3460	April 20, 2010
Structure of mixed types, Coalescing not possible? CUDA Programming and Performance	4	3022	April 7, 2009
Coalesced structures possible? CUDA Programming and Performance	3	2157	February 1, 2016
Coalesced Memory Access to Structs CUDA Programming and Performance	11	4713	September 19, 2009
coalescing struct loading problem CUDA Programming and Performance	21	12849	March 5, 2010
global memory coalescing data accessing problem CUDA Programming and Performance	0	1087	July 31, 2008

Coalescing Custom Data Structures

Related topics