want to know more detail of memory coalescing

Hi, everyone

as CUDA programming guide say, the global memory access can be coalesced if all threads of a half-warp satisfies some conditions, but somone say one memory access can read hundreds of bytes, so what i want to know is how many bytes are really read from global memory as memory access coalesce, 64 bytes, 128 bytes or much more? If hundreds of bytes are read, does it mean lots of unsued will be droped?

Furthermore, in my following codes, if “__syncthreads()” after address computing is annotated, the program will crash down, otherwise, everything is all right. It seems that address computing without synchronizing confuse addressing, can anybody give me an explain of this?

Thanks in advance

my codes is as following:

[codebox]global void get_1itemset_Count6( int stream_start, int itemset_num, int trans_num, int *dk_buffer, ITEMCNT *dk_freq_itemset_cnt)

{ // blockDim.x must a multiple of 16, or better of 32

extern  __shared__  int sdata[];                               // sdata[blockDim.x]



unsigned int item_idx = stream_start + blockIdx.x * blockDim.x + threadIdx.x;

unsigned int item_cnt = 0, address = 0, tran_begin = 0, tran_sz;

unsigned long i;	                                               // number of transactions, should less than 32th power of 2

unsigned int j;

	 

// first load		

sdata[threadIdx.x] = dk_buffer[threadIdx.x];

__syncthreads();                             

// count and read next transaction data block if necessary

tran_sz = sdata[1];                                                  // sdata[1]: store the size of current transaction

for(i = 0; i < trans_num ; i++) { 

	for(j = 0; j < tran_sz; j++)                               // all of the two branches will be executed

		if(sdata[tran_begin + 2 + j] == item_idx) item_cnt++;			

	tran_begin += (tran_sz + 2);

	if(tran_begin+1 > blockDim.x-1 || tran_begin + sdata[tran_begin+1] + 1 > blockDim.x-1) {							

		address = address + (tran_begin&(~15));   // step backward several 64 bytes 

		__syncthreads();				

		sdata[threadIdx.x] = dk_buffer[address + threadIdx.x];

   		__syncthreads();                        

		tran_begin = tran_begin&(16-1);               // tran_begin % 16;

	}

	tran_sz = sdata[tran_begin + 1];		

}

…[/codebox]

About your second question, you do need a syncthread since every thread will be using every values in shared memory. You therefore need to make sure every value is available to every thread and not just the value they themselves just wrote to shared memory.

About your second question, i was about to answer, only to find out i didnt know the answer.
As per the programming guide, the memory interface can fetch 32 64 or 128 bytes in one swoop. So if all threads of a half warp are reading a ffloat value, you need 16*32/8= 64 bytes to service the 16 threads.
If youre not doing coalesced reads, im guessing itll still read at least 64 bytes (on compute 1.0 and 1.1) and will trash the rest.
Where i hit the wall is the fact that the memory bus is 384bits wide… so where are those 64 or 128 bytes coming from.
Well… now that i think about it, it takes clock cycles to execute one memory transaction, so there we go, enough bandwidth, when considering 4 clock cycles, to read 128 bytes.

Hope this helps!

edit: read the original code too quickly and my remark about the syncthread was refering to the first one, not the one in the “if”.

Using a __syncthreads() in an if block can be hazardous and lead to deadlocks, or so I was told ;)

Yes, __syncthreads() in an if() is dangerous. If the warp diverges, not all threads execute __syncthreads(), and you get a deadlock.

To answer the OP, the __synchthreads in the code is not after the address computation, but after the read-from-smem that is inside the if statement itself.

if( sdata[tran_begin+1] > blockDim.x-1 ) {							

	__syncthreads();				

	sdata[threadIdx.x] = dk_buffer[address + threadIdx.x];

	__syncthreads();

}

SAME AS:

int val = sdata[tran_begin+1];

__syncthreads();

if( val > blockDim.x-1 ) {

	sdata[threadIdx.x] = dk_buffer[address + threadIdx.x];

}

__syncthreads();

You should use the 2nd version of the code, to remove possibility of deadlock.

Btw, to be better understood, keep in mind

‘annotated’ => commented

‘address computing’ => address computation

‘crash down’ => crash

I am aware that __syncthreads() in an if() is dangerous, and i make sure that all of the threads in a warp will go through the same execution path in my program, but i didn’t realize problem of write-after-read, which is my mistake.

Thanks to all, specially to alex_dubinsky for your code and language advices. English isn’t my mother tongue, but i will do more effort to learn as i think communicating with people from different countries is a good thing. ^_^