You know the function __syncthreads can only barrier the threads on the same block. And the block size is at most 512. Now I have a large number of threads to synchronize, say 1024, I need to put them to different blocks. Then how to synchronize threads in this situation? Thank you very much!
Great, thanks a lot. Then the following question is about reduction. If I have a 1024*1024 matrix and I need to compute every sum of the 1024 elemetns on the same column. How to implement reduction on this kind of problem? (note that 1024 > 512 is large External Media
You can reduce 1024 elements by using less than 1024 threads by looping. Again, just see how the reduction is implemented in the SDK. I suggest you avoid two kernels (and you can do it) because you won’t have the extra overhead of calling a second kernel.
To be exact, a bog-standard reduction (as in, it is in the SDK-example) reduces 1024 elements with 512 threads, no looping required, so a 1024x1024 matrix can be reduced as you want with 1024 blocks, each consisting of 512 threads.
I guess it’s better to look through CUDPP sources, because SDK example does not use warp-sized reductions (which should be faster because they don’t require syncs btw each step
and have no power-of-two strides)
in my opinion one can use 256 threads,
first, each thread reduces its quad-word, then execute 8 warp-sized reductions followed by 1 reduction of 8 elements.
In case you don’t need the full prefix sum but only the highest value, it should be even easier…
Yes. I’ve tried and find it is more than 2 times faster for using one kernel with a easy loop than using two kernels. The extra overhead for calling kernel is much. Thanks a lot.
Your advise seems great. I’ll have a try. Another question is about when I need to do some processing on a 2048 * 2048 matrix, I divide them to 4 submatrix each is 512 * 2048. For every submatrix I make a kernel (1024 blocks) * (512 threads per block), for every block I should have a shared memory about 12K bytes to perform the reduction. Then the shared memory would exceed the limit. Does it mean I should also divide the 2048 colums too.(such as 1024 * 2). Or is there some good technique to solve the problem of shared memory limitation. (My gpu is gtx 280) Thanks.
if I understand it correctly, you want to reduce 2048 elements with 512 threads.
Then again you can run 16 warp-sized reductions (16*32 = 512) followed by 1 16-element reduction.
For each reduction you then need 32 words + 16 words “overrun” filled with 0s (i.e., to avoid unnecessary branching when offsets go beyond the boundary).
That is, in total 48 * 16 = 3K shared memory. In fact, you don’t need to load all 2048 elements to shared memory - only the highest word of each quad (others are kept in registers)
Hi, Thanks. I only can find a junior reduction in the cudpp1.0a source. In the cuda 2.1 sdk, the reduction example is detailed (7 versions) and it can make my kernel faster. I do not clearly understand the “16 warp-sized reductions”, does it mean I should call the kernel 16 times to do the reduction? And about “others are kept in registers”, I think they should be kept in global memory. If they are only kept in registers, how can they be got by other threads in the same block. :">
Thank you very much. You are really experienced. I’ve watched the code several times and I think I understand it now. It’s greatly efficient. One question is why need 49 words for every warp sized reduction? I think 48(32+16) is enough. And, when update elements kept in registers, I think a.y = OP(a.y, t); a.z = OP(a.z, t).
By the way, I reviewed my kernel and find that 512 threads per block is not suitable. Because for every multiprocessor there are at most 1024 active threads, at most 8 active blocks, at most 16KB shared memory (my gpu is gtx280). So 512 threads caused that there are only 2 active blocks active, so the speed is slowed. When I make about 64 or 128 threads per block (use a loop to scan several elements), the active blocks per multiprocessor is 8 and the speed is faster.
And, I think maybe the code I find in cudpp is not the one you refer to. It’s just as