how do coalesced writes work

I have a problem in understanding the performance of random writes.
Say, threads t0, t1, t2, and t3 in a warp write to places x0, x1, x0, x1 (x0!=x1). Will the accesses to x0 of t0 and t2 be coalesced or shared?
In my experiments, it is likely there are coalesced. However, this does not satisfy the coalesced accesses in the programming guide. thanks.

On G80 you will lose coalescing in this case. Threads should access sequential memory addresses in order to be coalesced. The coalescing rules are pretty clearly described in the programming guide.


thanks, Mark.

My further question is how many data transfer there are.
(a) two fetches to x0, two fetches to x1; OR
(b ) one fetch to x0, one fetch to x1?

Your question is unclear – are you talking about specific code?


okay. We have the code like the following:

int d_loc[4]={0, 1, 100, 101};// the write location

the kernel://suppose we only have four threads in a thread block

scatter(int d_input, int d_output, int *d_loc)


d_output[d_loc[tid]]=d_input[tid]; //tid is the thread id.


Since, d_output[0] and d_output[1] is close to each other in the device memory. so do d_output[100] and d_output[101] , I have the question: are there four data fetches from the device memory or two fetches (considering only writes)?

It doesnt’ make sense to discuss “fetches considering only writes”. When I think of fetch, I think of read, not write.

In your code, there are 4 reads and four writes.

Note that with < 16 threads per block, you’ll never get coalescing. Coalescing occurs on each half-warp = 16 threads.


thanks. I understand the “16” rules. I just take four for simplicity.

The problem is that if I perform the scatter in multiple passes. the performance will much better than in a single pass.

Given the code:

scatter_opt(int d_input, int d_output, int *d_loc, int from, int to)


if(d_loc[tid]>=from && d_loc[tid]< to)

d_output[d_loc[tid]]=d_input[tid]; //tid is the thread id.


In our multi-pass scheme, I will call the scatter_opt twice. first with (from =0, to =2) ; second with (from =100, to =102). The performance will much better than just call scatter once.

I don’t know the reason.

I assume you are doing much more than 4 in this test too, because otherwise I can’t believe you would see a difference (or it would be very noisy).

I expect that splitting into two passes is just improving coalescing, but I can’t tell without a real repro app.

I think your colleague at HKUST asked a very similar question, but I wasn’t able to answer it either, unfortunately.


Yes indeed, we’are in the same room:)

btw, Weird that I use


texture< AttrType, 2, cudaReadModeElementType>tex;


	tex.addressMode[0] = cudaAddressModeClamp;

	tex.addressMode[1] = cudaAddressModeClamp;

    tex.filterMode = cudaFilterModePoint;

   tex.normalized = 0;

	cudaChannelFormatDesc desc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindUnsigned);

	cudaArray* a;

	cudaMallocArray(&a, &desc, 1024,2048);

	cudaMemcpyToArray(a, 0, 0, d_data, 1024*2048, cudaMemcpyDeviceToDevice);

  cudaBindTexture(tex, a);


	extern texture< AttrType, 2, cudaReadModeElementType> tex;


texfetch(tex,  x, y);


and still can’t see any speedup in comparison with using globalmemory.

I access texture randomly. Is this the reason that it’s still uncached?

Just now to avoid cache thrashing factor, I shrinked the array size to 32*32 int, which can resides in cache, but still found no speedup at all.


Are you sure you are memory bound? If you are not memory bound, optimizing for the cache will not help.