how to syncthreads between more than 512 threads


You know the function __syncthreads can only barrier the threads on the same block. And the block size is at most 512. Now I have a large number of threads to synchronize, say 1024, I need to put them to different blocks. Then how to synchronize threads in this situation? Thank you very much!


There is no way.

But you can use 2 kernel to do it.

First kernel for calculate data and store in global memory.

Second Kernel reads calculated data.


Great, thanks a lot. Then the following question is about reduction. If I have a 1024*1024 matrix and I need to compute every sum of the 1024 elemetns on the same column. How to implement reduction on this kind of problem? (note that 1024 > 512 is large :))

You might want to have a look at the reduction example in the SDK or use CUDPP for the reduction.

You can reduce 1024 elements by using less than 1024 threads by looping. Again, just see how the reduction is implemented in the SDK. I suggest you avoid two kernels (and you can do it) because you won’t have the extra overhead of calling a second kernel.

To be exact, a bog-standard reduction (as in, it is in the SDK-example) reduces 1024 elements with 512 threads, no looping required, so a 1024x1024 matrix can be reduced as you want with 1024 blocks, each consisting of 512 threads.

I guess it’s better to look through CUDPP sources, because SDK example does not use warp-sized reductions (which should be faster because they don’t require syncs btw each step

and have no power-of-two strides)

in my opinion one can use 256 threads,

first, each thread reduces its quad-word, then execute 8 warp-sized reductions followed by 1 reduction of 8 elements.

In case you don’t need the full prefix sum but only the highest value, it should be even easier…

Yes. I’ve tried and find it is more than 2 times faster for using one kernel with a easy loop than using two kernels. The extra overhead for calling kernel is much. Thanks a lot.

Your advise seems great. I’ll have a try. Another question is about when I need to do some processing on a 2048 * 2048 matrix, I divide them to 4 submatrix each is 512 * 2048. For every submatrix I make a kernel (1024 blocks) * (512 threads per block), for every block I should have a shared memory about 12K bytes to perform the reduction. Then the shared memory would exceed the limit. Does it mean I should also divide the 2048 colums too.(such as 1024 * 2). Or is there some good technique to solve the problem of shared memory limitation. (My gpu is gtx 280) Thanks.

if I understand it correctly, you want to reduce 2048 elements with 512 threads.

Then again you can run 16 warp-sized reductions (16*32 = 512) followed by 1 16-element reduction.

For each reduction you then need 32 words + 16 words “overrun” filled with 0s (i.e., to avoid unnecessary branching when offsets go beyond the boundary).

That is, in total 48 * 16 = 3K shared memory. In fact, you don’t need to load all 2048 elements to shared memory - only the highest word of each quad (others are kept in registers)

Hi, Thanks. I only can find a junior reduction in the cudpp1.0a source. In the cuda 2.1 sdk, the reduction example is detailed (7 versions) and it can make my kernel faster. I do not clearly understand the “16 warp-sized reductions”, does it mean I should call the kernel 16 times to do the reduction? And about “others are kept in registers”, I think they should be kept in global memory. If they are only kept in registers, how can they be got by other threads in the same block. :">

well, in my opinion cudpp reduction is pretty nice done.

as for 2048 reduction you can try smth along these lines (i don’t guarantee that it works but perhaps gives you a better insight):

#define WS 32  

#define HF 16

//! block with 512 threads

//! OP(a, b) - operation to reduce

template < class OP >

__global__ void _2048_reduce_kernel(float *g_out, const float *g_in) {

	//! memory consumption: 49 * 16 + 32 words

	extern __shared__ float shared[];

	float *data = shared;

	uint thid = threadIdx.x, thid_in_warp = thid & WS-1;

	// 2048 elements per block

	uint batch_ofs = blockIdx.x << 11;

	const float4 *in = (const float4 *)(g_in + batch_ofs);

	float4 a = in[thid]; // get quad from global mem

	// each thread reduces its part in place

	a.y = OP(a.y, a.x);

	a.z = OP(a.z, a.y);

	a.w = OP(a.w, a.z);

	float t = a.w; // element to be scanned

	// allocate space for 16 warp scans - 49 words of shared mem per one

	// to void BCs later

	volatile float *scan = data + HF + thid_in_warp +

		__umul24(thid >> 5, WS + HF + 1);

	scan[- HF] = Identity; // put identity element, i.e., OP(a, Identity) = a

	scan[0] = t; 

	float t;

	//! do warp scans

	t = OP(t, scan[-1]), scan[0] = t;

	t = OP(t, scan[-2]), scan[0] = t;

	t = OP(t, scan[-4]), scan[0] = t;

	t = OP(t, scan[-8]), scan[0] = t;

	t = OP(t, scan[-16]), scan[0] = t;

	float *postscan = data + HF +

		__umul24(WS + HF + 1, (WS*16) >> 5); // directly after 16 warp groups

	__syncthreads(); // ensure all warps have saved the data to shared mem

	if(thid < 16) { // 1st 16 threads do postscan

					// (scan highest elements of warp scans)


		unsigned loc_ofs = HF + WS-1 + __umul24(thid, WS + HF + 1);

		volatile float *ps = (volatile float *)postscan + thid;

		ps[- HF] = Identity; // load identity symbol

		t = data[loc_ofs], ps[0] = t;

		t = OP(t, ps[-1]), ps[0] = t;

		t = OP(t, ps[-2]), ps[0] = t;

		t = OP(t, ps[-4]), ps[0] = t;

		t = OP(t, ps[-8]), ps[0] = t;



	t = scan[0];

	// now update warp-scan results, postscan[-1] = Identity

	t = OP(t, postscan[(thid >> 5) - 1]), scan[0] = t;


	// and update elements kept in registers with previous scan vals

		a.w = t;

	t = scan[-1];

	a.x = OP(a.x, t);

	a.y = OP(a.y, a.x);

	a.z = OP(a.z, a.y); 

	//save back

	(float4 *)(g_out + batch_ofs)[thid] = a;


Hi, asm

Thank you very much. You are really experienced. I’ve watched the code several times and I think I understand it now. It’s greatly efficient. One question is why need 49 words for every warp sized reduction? I think 48(32+16) is enough. And, when update elements kept in registers, I think a.y = OP(a.y, t); a.z = OP(a.z, t).

By the way, I reviewed my kernel and find that 512 threads per block is not suitable. Because for every multiprocessor there are at most 1024 active threads, at most 8 active blocks, at most 16KB shared memory (my gpu is gtx280). So 512 threads caused that there are only 2 active blocks active, so the speed is slowed. When I make about 64 or 128 threads per block (use a loop to scan several elements), the active blocks per multiprocessor is 8 and the speed is faster.

And, I think maybe the code I find in cudpp is not the one you refer to. It’s just as


device void reduce_sum(T *s_data)


int thid = threadIdx.x;

int n = blockDim.x;

for (unsigned int d = n>>1; d > 0; d >>= 1)



    if (thid < d)


        s_data[thid] += s_data[thid + d];





It’s a junior one. May I ask how do you learn the parallel techniques before, is there some good materials or methods to learn it. Thanks again! :)

well, you should thank Mark Harris and all other people who made CUDPP available ))

they call it “Hillis-and-Steele-style scan” - my code is only a slight modification of that one

you can find it in 1.0a sources (1.0 version uses the old style)

right, 49 word offset is not critical here - this is to avoid bank conflicts when accessing ‘data[loc_ofs]’ during the second scan

as for the number of threads, 512 might be indeed too much, you can surely adapt this version for 64 or 128 threads

about learning… well, I didn’t learn these techniques specifically, i guess the best method is when you have some conrete problem to tackle,

then you inspect a similar code written by others to get the deeper insight, and this is how it goes ))

Great, thanks a lot. It helps much.