want to know more detail of memory coalescing

Gimurk · November 10, 2008, 1:14pm

Hi, everyone

as CUDA programming guide say, the global memory access can be coalesced if all threads of a half-warp satisfies some conditions, but somone say one memory access can read hundreds of bytes, so what i want to know is how many bytes are really read from global memory as memory access coalesce, 64 bytes, 128 bytes or much more? If hundreds of bytes are read, does it mean lots of unsued will be droped?

Furthermore, in my following codes, if “__syncthreads()” after address computing is annotated, the program will crash down, otherwise, everything is all right. It seems that address computing without synchronizing confuse addressing, can anybody give me an explain of this?

Thanks in advance

my codes is as following:

[codebox]global void get_1itemset_Count6( int stream_start, int itemset_num, int trans_num, int *dk_buffer, ITEMCNT *dk_freq_itemset_cnt)

{ // blockDim.x must a multiple of 16, or better of 32

extern  __shared__  int sdata[];                               // sdata[blockDim.x]



unsigned int item_idx = stream_start + blockIdx.x * blockDim.x + threadIdx.x;

unsigned int item_cnt = 0, address = 0, tran_begin = 0, tran_sz;

unsigned long i;	                                               // number of transactions, should less than 32th power of 2

unsigned int j;

	 

// first load		

sdata[threadIdx.x] = dk_buffer[threadIdx.x];

__syncthreads();                             

// count and read next transaction data block if necessary

tran_sz = sdata[1];                                                  // sdata[1]: store the size of current transaction

for(i = 0; i < trans_num ; i++) { 

	for(j = 0; j < tran_sz; j++)                               // all of the two branches will be executed

		if(sdata[tran_begin + 2 + j] == item_idx) item_cnt++;			

	tran_begin += (tran_sz + 2);

	if(tran_begin+1 > blockDim.x-1 || tran_begin + sdata[tran_begin+1] + 1 > blockDim.x-1) {							

		address = address + (tran_begin&(~15));   // step backward several 64 bytes 

		__syncthreads();				

		sdata[threadIdx.x] = dk_buffer[address + threadIdx.x];

   		__syncthreads();                        

		tran_begin = tran_begin&(16-1);               // tran_begin % 16;

	}

	tran_sz = sdata[tran_begin + 1];		

}

…[/codebox]

Ailleur · November 10, 2008, 1:43pm

About your second question, you do need a syncthread since every thread will be using every values in shared memory. You therefore need to make sure every value is available to every thread and not just the value they themselves just wrote to shared memory.

About your second question, i was about to answer, only to find out i didnt know the answer.
As per the programming guide, the memory interface can fetch 32 64 or 128 bytes in one swoop. So if all threads of a half warp are reading a ffloat value, you need 16*32/8= 64 bytes to service the 16 threads.
If youre not doing coalesced reads, im guessing itll still read at least 64 bytes (on compute 1.0 and 1.1) and will trash the rest.
Where i hit the wall is the fact that the memory bus is 384bits wide… so where are those 64 or 128 bytes coming from.
Well… now that i think about it, it takes clock cycles to execute one memory transaction, so there we go, enough bandwidth, when considering 4 clock cycles, to read 128 bytes.

Hope this helps!

edit: read the original code too quickly and my remark about the syncthread was refering to the first one, not the one in the “if”.

Linny · November 10, 2008, 3:20pm

Using a __syncthreads() in an if block can be hazardous and lead to deadlocks, or so I was told ;)

alex_dubinsky · November 10, 2008, 7:43pm

Yes, __syncthreads() in an if() is dangerous. If the warp diverges, not all threads execute __syncthreads(), and you get a deadlock.

To answer the OP, the __synchthreads in the code is not after the address computation, but after the read-from-smem that is inside the if statement itself.

if( sdata[tran_begin+1] > blockDim.x-1 ) {							

	__syncthreads();				

	sdata[threadIdx.x] = dk_buffer[address + threadIdx.x];

	__syncthreads();

}

SAME AS:

int val = sdata[tran_begin+1];

__syncthreads();

if( val > blockDim.x-1 ) {

	sdata[threadIdx.x] = dk_buffer[address + threadIdx.x];

}

__syncthreads();

You should use the 2nd version of the code, to remove possibility of deadlock.

Btw, to be better understood, keep in mind

‘annotated’ => commented

‘address computing’ => address computation

‘crash down’ => crash

Gimurk · November 11, 2008, 4:45am

I am aware that __syncthreads() in an if() is dangerous, and i make sure that all of the threads in a warp will go through the same execution path in my program, but i didn’t realize problem of write-after-read, which is my mistake.

Thanks to all, specially to alex_dubinsky for your code and language advices. English isn’t my mother tongue, but i will do more effort to learn as i think communicating with people from different countries is a good thing. ^_^

Topic		Replies	Views
coalescing problem CUDA Programming and Performance	4	1076	August 8, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	2076	December 9, 2010
Memory coalescing in one thread CUDA Programming and Performance	17	16652	March 31, 2011
memory accesses by thread block accessing memory by thread block is only semi-coalesced? CUDA Programming and Performance	7	3781	February 16, 2009
memory coalescing and syncronizing problem CUDA Programming and Performance	1	2458	November 10, 2008
coalescing example CUDA Programming and Performance	2	1275	May 11, 2009
Whether this is coalescing access several cases to decide CUDA Programming and Performance	0	1569	August 2, 2011
Single address coalescing CUDA Programming and Performance	2	9514	January 29, 2011
1 coalesced global memory load = 16 loads? CUDA Programming and Performance	0	916	January 23, 2011
Another question about coalesced reads/writes CUDA Programming and Performance	10	2167	August 18, 2009

want to know more detail of memory coalescing

Related topics