parallel scan without syncthreads

Hi,

I have one simple question, can I take advantage of the fact that threads in a warp execute simultaneosly and so discard __syncthreads() call for a parallel scan algorithm , like in my case , a sum of an array?. I won’t get a race condition or will I ? My hardware is GTX280, so it has warp size of 32.

This is the kernel example:

extern "C" __global__ void reduce_float_32_elts(float *dst_data,float *src_data,uint num_elts) {

	__shared__ __device__ float tmp[32];

	uint dst_idx,me;

	dst_idx=blockIdx.x*(blockDim.x*blockDim.y)+threadIdx.y*blockDim.x;

	me=dst_idx+threadIdx.x;

	tmp[threadIdx.x]=0.0f;

	if (me<num_elts) {

		tmp[threadIdx.x]=src_data[me];

		if (threadIdx.x<16) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+16];

		}

		if (threadIdx.x<8) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+8];

		}

		if (threadIdx.x<4) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+4];

		}

		if (threadIdx.x<2) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+2];

		}

		tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];

	}

	if (threadIdx.x==0) dst_data[dst_idx]=tmp[0];

}

It should be faster to call many kernels like this from the CPU until the sum is complete, than do a complex kernel with __syncthreads() or more "if"s to handle special conditions. am I right?

Any comment will be very appreciated

Regards

Hi,

I have one simple question, can I take advantage of the fact that threads in a warp execute simultaneosly and so discard __syncthreads() call for a parallel scan algorithm , like in my case , a sum of an array?. I won’t get a race condition or will I ? My hardware is GTX280, so it has warp size of 32.

This is the kernel example:

extern "C" __global__ void reduce_float_32_elts(float *dst_data,float *src_data,uint num_elts) {

	__shared__ __device__ float tmp[32];

	uint dst_idx,me;

	dst_idx=blockIdx.x*(blockDim.x*blockDim.y)+threadIdx.y*blockDim.x;

	me=dst_idx+threadIdx.x;

	tmp[threadIdx.x]=0.0f;

	if (me<num_elts) {

		tmp[threadIdx.x]=src_data[me];

		if (threadIdx.x<16) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+16];

		}

		if (threadIdx.x<8) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+8];

		}

		if (threadIdx.x<4) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+4];

		}

		if (threadIdx.x<2) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+2];

		}

		tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];

	}

	if (threadIdx.x==0) dst_data[dst_idx]=tmp[0];

}

It should be faster to call many kernels like this from the CPU until the sum is complete, than do a complex kernel with __syncthreads() or more "if"s to handle special conditions. am I right?

Any comment will be very appreciated

Regards

what is your configuration of thread block?

if your have 2D thread block, for example,

dim3 block(32,2)

then two warps will write to same shared memory when executing

tmp[threadIdx.x]=src_data[me];

race condition occurs

what is your configuration of thread block?

if your have 2D thread block, for example,

dim3 block(32,2)

then two warps will write to same shared memory when executing

tmp[threadIdx.x]=src_data[me];

race condition occurs

correct. syncthreads is superfluous unless you need to transfer data between warps. but you don’t need all those ifs. the other threads in the warp aren’t doing anything anyways so it doesn’t hurt to have them do useless computations if it’ll ultimately shave some clock cycles. (in fact either way they will execute the instructions, they will just be “predicated” on a warp divergence.)

if (threadIdx.x<16) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+16];

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+8];

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+4];

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+2];

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];

		}

	}

	if (threadIdx.x==0) dst_data[dst_idx]=tmp[0];

}

correct. syncthreads is superfluous unless you need to transfer data between warps. but you don’t need all those ifs. the other threads in the warp aren’t doing anything anyways so it doesn’t hurt to have them do useless computations if it’ll ultimately shave some clock cycles. (in fact either way they will execute the instructions, they will just be “predicated” on a warp divergence.)

if (threadIdx.x<16) {

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+16];

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+8];

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+4];

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+2];

			tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];

		}

	}

	if (threadIdx.x==0) dst_data[dst_idx]=tmp[0];

}

correct. I will have to add another dimension to the shared array , sort of:

__shared__ __device__ float tmp[2][32];

and access like:

tmp[threadIdx.y][threadIdx.x]

to eliminate this race condition.

Thanks!

correct. I will have to add another dimension to the shared array , sort of:

__shared__ __device__ float tmp[2][32];

and access like:

tmp[threadIdx.y][threadIdx.x]

to eliminate this race condition.

Thanks!

Often when you use the “a warp doesn’t need syncthreads()” trick, you need to declare the shared memory as volatile. Sometimes it works without it.

Also, if you’re really micro-optimizing for a tight inner loop, you can drop the last “tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];” line and move it to the final output with " if (threadIdx.x==0) dst_data[dst_idx]=tmp[0]+tmp[1];" This saves a couple instructions.

Often when you use the “a warp doesn’t need syncthreads()” trick, you need to declare the shared memory as volatile. Sometimes it works without it.

Also, if you’re really micro-optimizing for a tight inner loop, you can drop the last “tmp[threadIdx.x]=tmp[threadIdx.x]+tmp[threadIdx.x+1];” line and move it to the final output with " if (threadIdx.x==0) dst_data[dst_idx]=tmp[0]+tmp[1];" This saves a couple instructions.

It should work correctly without it on pre-Fermi hardware, but on Fermi it must be volatile, otherwise compiler optimisation can cause it to fail.

It should work correctly without it on pre-Fermi hardware, but on Fermi it must be volatile, otherwise compiler optimisation can cause it to fail.