I have one simple question, can I take advantage of the fact that threads in a warp execute simultaneosly and so discard __syncthreads() call for a parallel scan algorithm , like in my case , a sum of an array?. I won’t get a race condition or will I ? My hardware is GTX280, so it has warp size of 32.
This is the kernel example:
extern "C" __global__ void reduce_float_32_elts(float *dst_data,float *src_data,uint num_elts) {
__shared__ __device__ float tmp[32];
uint dst_idx,me;
dst_idx=blockIdx.x*(blockDim.x*blockDim.y)+threadIdx.y*blockDim.x;
me=dst_idx+threadIdx.x;
tmp[threadIdx.x]=0.0f;
if (me<num_elts) {
tmp[threadIdx.x]=src_data[me];
if (threadIdx.x<16) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+16];
}
if (threadIdx.x<8) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+8];
}
if (threadIdx.x<4) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+4];
}
if (threadIdx.x<2) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+2];
}
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];
}
if (threadIdx.x==0) dst_data[dst_idx]=tmp[0];
}
It should be faster to call many kernels like this from the CPU until the sum is complete, than do a complex kernel with __syncthreads() or more "if"s to handle special conditions. am I right?
I have one simple question, can I take advantage of the fact that threads in a warp execute simultaneosly and so discard __syncthreads() call for a parallel scan algorithm , like in my case , a sum of an array?. I won’t get a race condition or will I ? My hardware is GTX280, so it has warp size of 32.
This is the kernel example:
extern "C" __global__ void reduce_float_32_elts(float *dst_data,float *src_data,uint num_elts) {
__shared__ __device__ float tmp[32];
uint dst_idx,me;
dst_idx=blockIdx.x*(blockDim.x*blockDim.y)+threadIdx.y*blockDim.x;
me=dst_idx+threadIdx.x;
tmp[threadIdx.x]=0.0f;
if (me<num_elts) {
tmp[threadIdx.x]=src_data[me];
if (threadIdx.x<16) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+16];
}
if (threadIdx.x<8) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+8];
}
if (threadIdx.x<4) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+4];
}
if (threadIdx.x<2) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+2];
}
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];
}
if (threadIdx.x==0) dst_data[dst_idx]=tmp[0];
}
It should be faster to call many kernels like this from the CPU until the sum is complete, than do a complex kernel with __syncthreads() or more "if"s to handle special conditions. am I right?
correct. syncthreads is superfluous unless you need to transfer data between warps. but you don’t need all those ifs. the other threads in the warp aren’t doing anything anyways so it doesn’t hurt to have them do useless computations if it’ll ultimately shave some clock cycles. (in fact either way they will execute the instructions, they will just be “predicated” on a warp divergence.)
if (threadIdx.x<16) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+16];
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+8];
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+4];
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+2];
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];
}
}
if (threadIdx.x==0) dst_data[dst_idx]=tmp[0];
}
correct. syncthreads is superfluous unless you need to transfer data between warps. but you don’t need all those ifs. the other threads in the warp aren’t doing anything anyways so it doesn’t hurt to have them do useless computations if it’ll ultimately shave some clock cycles. (in fact either way they will execute the instructions, they will just be “predicated” on a warp divergence.)
if (threadIdx.x<16) {
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+16];
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+8];
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+4];
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+2];
tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];
}
}
if (threadIdx.x==0) dst_data[dst_idx]=tmp[0];
}
Often when you use the “a warp doesn’t need syncthreads()” trick, you need to declare the shared memory as volatile. Sometimes it works without it.
Also, if you’re really micro-optimizing for a tight inner loop, you can drop the last “tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];” line and move it to the final output with " if (threadIdx.x==0) dst_data[dst_idx]=tmp[0]+tmp[1];" This saves a couple instructions.
Often when you use the “a warp doesn’t need syncthreads()” trick, you need to declare the shared memory as volatile. Sometimes it works without it.
Also, if you’re really micro-optimizing for a tight inner loop, you can drop the last “tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];” line and move it to the final output with " if (threadIdx.x==0) dst_data[dst_idx]=tmp[0]+tmp[1];" This saves a couple instructions.