how do coalesced writes work

hebingsheng · March 30, 2007, 1:39pm

I have a problem in understanding the performance of random writes.
Say, threads t0, t1, t2, and t3 in a warp write to places x0, x1, x0, x1 (x0!=x1). Will the accesses to x0 of t0 and t2 be coalesced or shared?
In my experiments, it is likely there are coalesced. However, this does not satisfy the coalesced accesses in the programming guide. thanks.

Mark_Harris · March 30, 2007, 1:46pm

On G80 you will lose coalescing in this case. Threads should access sequential memory addresses in order to be coalesced. The coalescing rules are pretty clearly described in the programming guide.

Mark

hebingsheng · March 30, 2007, 2:47pm

thanks, Mark.

My further question is how many data transfer there are.
(a) two fetches to x0, two fetches to x1; OR
(b ) one fetch to x0, one fetch to x1?

Mark_Harris · March 30, 2007, 2:51pm

Your question is unclear – are you talking about specific code?

Mark

hebingsheng · March 30, 2007, 2:58pm

okay. We have the code like the following:

int d_loc[4]={0, 1, 100, 101};// the write location

the kernel://suppose we only have four threads in a thread block

scatter(int d_input, int d_output, int *d_loc)

{

d_output[d_loc[tid]]=d_input[tid]; //tid is the thread id.

}

Since, d_output[0] and d_output[1] is close to each other in the device memory. so do d_output[100] and d_output[101] , I have the question: are there four data fetches from the device memory or two fetches (considering only writes)?

Mark_Harris · March 30, 2007, 3:29pm

It doesnt’ make sense to discuss “fetches considering only writes”. When I think of fetch, I think of read, not write.

In your code, there are 4 reads and four writes.

Note that with < 16 threads per block, you’ll never get coalescing. Coalescing occurs on each half-warp = 16 threads.

Mark

hebingsheng · March 30, 2007, 3:34pm

thanks. I understand the “16” rules. I just take four for simplicity.

The problem is that if I perform the scatter in multiple passes. the performance will much better than in a single pass.

Given the code:

scatter_opt(int d_input, int d_output, int *d_loc, int from, int to)

{

if(d_loc[tid]>=from && d_loc[tid]< to)

d_output[d_loc[tid]]=d_input[tid]; //tid is the thread id.

}

In our multi-pass scheme, I will call the scatter_opt twice. first with (from =0, to =2) ; second with (from =100, to =102). The performance will much better than just call scatter once.

I don’t know the reason.

Mark_Harris · March 30, 2007, 5:17pm

I assume you are doing much more than 4 in this test too, because otherwise I can’t believe you would see a difference (or it would be very noisy).

I expect that splitting into two passes is just improving coalescing, but I can’t tell without a real repro app.

I think your colleague at HKUST asked a very similar question, but I wasn’t able to answer it either, unfortunately.

Mark

yk_cadcg · March 30, 2007, 5:56pm

Yes indeed, we’are in the same room:)

btw, Weird that I use

//main.cu;

texture< AttrType, 2, cudaReadModeElementType>tex;

...

	tex.addressMode[0] = cudaAddressModeClamp;

	tex.addressMode[1] = cudaAddressModeClamp;

    tex.filterMode = cudaFilterModePoint;

   tex.normalized = 0;

	cudaChannelFormatDesc desc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindUnsigned);

	cudaArray* a;

	cudaMallocArray(&a, &desc, 1024,2048);

	cudaMemcpyToArray(a, 0, 0, d_data, 1024*2048, cudaMemcpyDeviceToDevice);

  cudaBindTexture(tex, a);

//kernel.cu:

	extern texture< AttrType, 2, cudaReadModeElementType> tex;

..

texfetch(tex,  x, y);

..

and still can’t see any speedup in comparison with using globalmemory.

I access texture randomly. Is this the reason that it’s still uncached?

Just now to avoid cache thrashing factor, I shrinked the array size to 32*32 int, which can resides in cache, but still found no speedup at all.

Thanks.

Mark_Harris · April 2, 2007, 7:00pm

Are you sure you are memory bound? If you are not memory bound, optimizing for the cache will not help.

Mark

Topic		Replies	Views
coalescing problem CUDA Programming and Performance	4	1133	August 8, 2011
Coalesced Memory access related doubt CUDA Programming and Performance	13	2238	December 9, 2010
Accessing same global memory address within warps CUDA Programming and Performance	4	4446	October 24, 2018
Question about coalesced memory access CUDA Programming and Performance	10	2908	September 24, 2009
Speeding up memory writes CUDA Programming and Performance	5	3314	July 3, 2008
Help with kernel CUDA Programming and Performance	6	1683	April 23, 2010
coalescing example CUDA Programming and Performance	2	1309	May 11, 2009
Coalescence of global memory reading and writing CUDA Programming and Performance	1	535	May 12, 2018
Global Memory Coalescing: Read and Write Memory Coalescing CUDA Programming and Performance	9	8336	July 31, 2007
Memory coalescing in one thread CUDA Programming and Performance	17	16811	March 31, 2011

how do coalesced writes work

Related topics