Wil unroling loops speed up the kernel ?

nuliknol · October 29, 2010, 1:25pm

Hi,

I am using cuda to activate neural networks. It is a simple artificial network type (ann) , and I activate it layer by layer. I do it in 2 phases. First I calculate the sumproducts of [weight] x [input value] , and then I sum all the sumproducts and apply activation function. I use the shared memory and so my kernels are limited to certain array sizes, but this limitation is not the problem. The problem is, kernels are too slow and I want to accelerate them.

In both I have a loop, that calculates the sum and so I was wondering if unrolling the loop (that will increase kernel size) will increase speed? Both loops just iterate 8 times to do a parallel sum operation.

Here are my kernels (still under development), i run these 2 kernels per each layer of the ann:

Kernel to multiply weights x inputs:

// params: 

//	block.x -> destination node number multiplied by [num_anns] /* division is needed */

//	  block.y -> offset ( should range from 0 to _MAX_THREADS_PER_KERNEL_ , because of _MAX_THREADS_PER_KERNEL_ limitation of sumproduct calculation )

//	  thread.x -> link number (ranges from 0 to _MAX_THREADS_PER_KERNEL_ )

extern "C" __global__ void act_layer(uint cur_layer,uint num_outputs,uint num_offsets) {// run: one thread per link

	__shared__ __device__ node_value_t tmp[_MAX_THREADS_PER_KERNEL_];

	uint i_ann,dst_num,val_idx,weight_idx,offset=0,pow,prev_pow,aux;

	node_value_t sum=0.0f,weight,value,tmpval1,tmpval2,*data;

	node_value_t *sumproducts;

	ann_t *a;

	ann_layer_t *al;

	i32_t *idxptr;

	sumproducts=(node_value_t*) d_gpu_mem.sumproducts_base;

	idxptr=(i32_t*) d_gpu_mem.ann_index_list;

	data=(node_value_t*) d_gpu_mem.ad_base;

	a=(ann_t*) d_gpu_mem.anns_base;

	dst_num=blockIdx.x%num_outputs;

	aux=blockIdx.x/num_outputs;

	i_ann=idxptr[aux];

	a=&a[i_ann];

	al=&a->layers[cur_layer];

	offset=blockIdx.y;

	/// here the sum begins

	val_idx=offset*_MAX_THREADS_PER_KERNEL_+threadIdx.x;

	weight_idx=(dst_num*al->num_inputs)+val_idx;

	value=0.0f;weight=0.0f;

	if ((val_idx<al->num_inputs)&&(threadIdx.x<_MAX_THREADS_PER_KERNEL_)) {

		value=data[al->ia_inputs+val_idx];

		weight=data[al->ia_weights+weight_idx];

	}

	__syncthreads();

	tmp[threadIdx.x]=value*weight;

	__syncthreads();

	prev_pow=0;pow=2;

	for(pow=2;pow<=512;pow=pow<<1) {

		aux=pow>>1;

		if (aux==0) aux=1;

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmpval1=tmp[threadIdx.x+aux];

			}

		}

		__syncthreads();

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmpval2=tmp[threadIdx.x];

			}

		}

		__syncthreads();

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmp[threadIdx.x]=tmpval1+tmpval2;

			}

		}

		prev_pow=pow;

	}

	__syncthreads();

	if (threadIdx.x==0) sum=sum+tmp[0];

	__syncthreads();

	if (threadIdx.x==0) {

		aux=i_ann*(num_outputs*num_offsets)+dst_num*num_offsets+offset;

		sumproducts[aux]=sum;   // stores corresponding sumproduct, the output of the whole kernel.

	}

}

Kernel that sums suproducts and applies activation function

extern "C" __global__ void merge_subproducts(uint cur_layer,uint num_outputs,uint num_offsets) { // run: one thread per sumproduct, number of blocks = 1

	__shared__ __device__ node_value_t tmp[_MAX_THREADS_PER_KERNEL_];

	node_value_t *sumproducts,sum=0.0f,*data,tmpval1,tmpval2;

	i32_t *idxptr;

	uint i_ann,pow,prev_pow,aux,dst_num,src_idx;

	ann_t *a;

	ann_layer_t *al;

	sumproducts=(node_value_t*) d_gpu_mem.sumproducts_base;

	a=(ann_t*) d_gpu_mem.anns_base;

	idxptr=(i32_t*) d_gpu_mem.ann_index_list;

	data=(node_value_t*) d_gpu_mem.ad_base;

	dst_num=blockIdx.x;

	i_ann=idxptr[blockIdx.y];

	a=&a[i_ann];

	al=&a->layers[cur_layer];

	/// get all sumproducts

	src_idx=i_ann*(num_outputs*num_offsets)+dst_num*num_offsets+threadIdx.x;

	__syncthreads();

	tmp[threadIdx.x]=sumproducts[src_idx];

	__syncthreads();

	prev_pow=0;pow=2;

	for(pow=2;pow<=512;pow=pow<<1) {

		aux=pow>>1;

		if (aux==0) aux=1;

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmpval1=tmp[threadIdx.x+aux];

			}

		}

		__syncthreads();

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				 tmpval2=tmp[threadIdx.x];

			}

		}

		__syncthreads();

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmp[threadIdx.x]=tmpval1+tmpval2;

			}

		}

		prev_pow=pow;

	}

	__syncthreads();

	if (threadIdx.x==0) sum=tmp[0]; 

	__syncthreads();

	if (threadIdx.x==0) {

		sum=sum+data[al->ia_biases_o+dst_num]; // add the Bias

		if (sum>20.0f) sum=20.0f;	// limit ranges to prevent 'infinite'-type errors

		if (sum<-20.0f) sum=-20.0f;

		data[al->ia_sum_outputs+dst_num]=sum;

		// activation function

//		data[al->ia_outputs+dst_num]=1/(1+exp(-sum));		// standard sigmoid

		data[al->ia_outputs+dst_num]=2*(1/(1+exp(-sum)))-1;	// adjusted sigmoid to fit tangent from -1 to +1

	}

}

Will appreciate any comments on improving my code

Regards

nuliknol · October 29, 2010, 1:25pm

Hi,

I am using cuda to activate neural networks. It is a simple artificial network type (ann) , and I activate it layer by layer. I do it in 2 phases. First I calculate the sumproducts of [weight] x [input value] , and then I sum all the sumproducts and apply activation function. I use the shared memory and so my kernels are limited to certain array sizes, but this limitation is not the problem. The problem is, kernels are too slow and I want to accelerate them.

In both I have a loop, that calculates the sum and so I was wondering if unrolling the loop (that will increase kernel size) will increase speed? Both loops just iterate 8 times to do a parallel sum operation.

Here are my kernels (still under development), i run these 2 kernels per each layer of the ann:

Kernel to multiply weights x inputs:

// params: 

//	block.x -> destination node number multiplied by [num_anns] /* division is needed */

//	  block.y -> offset ( should range from 0 to _MAX_THREADS_PER_KERNEL_ , because of _MAX_THREADS_PER_KERNEL_ limitation of sumproduct calculation )

//	  thread.x -> link number (ranges from 0 to _MAX_THREADS_PER_KERNEL_ )

extern "C" __global__ void act_layer(uint cur_layer,uint num_outputs,uint num_offsets) {// run: one thread per link

	__shared__ __device__ node_value_t tmp[_MAX_THREADS_PER_KERNEL_];

	uint i_ann,dst_num,val_idx,weight_idx,offset=0,pow,prev_pow,aux;

	node_value_t sum=0.0f,weight,value,tmpval1,tmpval2,*data;

	node_value_t *sumproducts;

	ann_t *a;

	ann_layer_t *al;

	i32_t *idxptr;

	sumproducts=(node_value_t*) d_gpu_mem.sumproducts_base;

	idxptr=(i32_t*) d_gpu_mem.ann_index_list;

	data=(node_value_t*) d_gpu_mem.ad_base;

	a=(ann_t*) d_gpu_mem.anns_base;

	dst_num=blockIdx.x%num_outputs;

	aux=blockIdx.x/num_outputs;

	i_ann=idxptr[aux];

	a=&a[i_ann];

	al=&a->layers[cur_layer];

	offset=blockIdx.y;

	/// here the sum begins

	val_idx=offset*_MAX_THREADS_PER_KERNEL_+threadIdx.x;

	weight_idx=(dst_num*al->num_inputs)+val_idx;

	value=0.0f;weight=0.0f;

	if ((val_idx<al->num_inputs)&&(threadIdx.x<_MAX_THREADS_PER_KERNEL_)) {

		value=data[al->ia_inputs+val_idx];

		weight=data[al->ia_weights+weight_idx];

	}

	__syncthreads();

	tmp[threadIdx.x]=value*weight;

	__syncthreads();

	prev_pow=0;pow=2;

	for(pow=2;pow<=512;pow=pow<<1) {

		aux=pow>>1;

		if (aux==0) aux=1;

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmpval1=tmp[threadIdx.x+aux];

			}

		}

		__syncthreads();

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmpval2=tmp[threadIdx.x];

			}

		}

		__syncthreads();

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmp[threadIdx.x]=tmpval1+tmpval2;

			}

		}

		prev_pow=pow;

	}

	__syncthreads();

	if (threadIdx.x==0) sum=sum+tmp[0];

	__syncthreads();

	if (threadIdx.x==0) {

		aux=i_ann*(num_outputs*num_offsets)+dst_num*num_offsets+offset;

		sumproducts[aux]=sum;   // stores corresponding sumproduct, the output of the whole kernel.

	}

}

Kernel that sums suproducts and applies activation function

extern "C" __global__ void merge_subproducts(uint cur_layer,uint num_outputs,uint num_offsets) { // run: one thread per sumproduct, number of blocks = 1

	__shared__ __device__ node_value_t tmp[_MAX_THREADS_PER_KERNEL_];

	node_value_t *sumproducts,sum=0.0f,*data,tmpval1,tmpval2;

	i32_t *idxptr;

	uint i_ann,pow,prev_pow,aux,dst_num,src_idx;

	ann_t *a;

	ann_layer_t *al;

	sumproducts=(node_value_t*) d_gpu_mem.sumproducts_base;

	a=(ann_t*) d_gpu_mem.anns_base;

	idxptr=(i32_t*) d_gpu_mem.ann_index_list;

	data=(node_value_t*) d_gpu_mem.ad_base;

	dst_num=blockIdx.x;

	i_ann=idxptr[blockIdx.y];

	a=&a[i_ann];

	al=&a->layers[cur_layer];

	/// get all sumproducts

	src_idx=i_ann*(num_outputs*num_offsets)+dst_num*num_offsets+threadIdx.x;

	__syncthreads();

	tmp[threadIdx.x]=sumproducts[src_idx];

	__syncthreads();

	prev_pow=0;pow=2;

	for(pow=2;pow<=512;pow=pow<<1) {

		aux=pow>>1;

		if (aux==0) aux=1;

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmpval1=tmp[threadIdx.x+aux];

			}

		}

		__syncthreads();

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				 tmpval2=tmp[threadIdx.x];

			}

		}

		__syncthreads();

		if ( (!(threadIdx.x&(pow-1))) || (prev_pow==threadIdx.x) ) {

			if ((threadIdx.x+aux)<blockDim.x) {

				tmp[threadIdx.x]=tmpval1+tmpval2;

			}

		}

		prev_pow=pow;

	}

	__syncthreads();

	if (threadIdx.x==0) sum=tmp[0]; 

	__syncthreads();

	if (threadIdx.x==0) {

		sum=sum+data[al->ia_biases_o+dst_num]; // add the Bias

		if (sum>20.0f) sum=20.0f;	// limit ranges to prevent 'infinite'-type errors

		if (sum<-20.0f) sum=-20.0f;

		data[al->ia_sum_outputs+dst_num]=sum;

		// activation function

//		data[al->ia_outputs+dst_num]=1/(1+exp(-sum));		// standard sigmoid

		data[al->ia_outputs+dst_num]=2*(1/(1+exp(-sum)))-1;	// adjusted sigmoid to fit tangent from -1 to +1

	}

}

Will appreciate any comments on improving my code

Regards

happyjack272 · October 29, 2010, 5:39pm

unrolling loops have advantages and disadvantages.

the advantage is, besides having more compute instructions per flow control instruction, you can decrease register contention, because it allows you to space read-after-writes further apart. this is in fact the main advantage and the primary reason for it (spacing read-after-writes further apart). however, this neccesarily requires more registers, and there precisely lies the disadvantage.

so if you have enough registers to spare, it’s a free optimization. (kernel size really isn’t going to be a limiting factor). though once you run out of registers and start spilling over into local memory, then your performance will drop like a rock. so that’s your primary concern.

(and on that note, i’m really happy that nvidia doubled the register count per core from 8800 series to the 400’s. (and tripled the shared memory) much more balanced if you ask me.)

happyjack272 · October 29, 2010, 5:39pm

unrolling loops have advantages and disadvantages.

the advantage is, besides having more compute instructions per flow control instruction, you can decrease register contention, because it allows you to space read-after-writes further apart. this is in fact the main advantage and the primary reason for it (spacing read-after-writes further apart). however, this neccesarily requires more registers, and there precisely lies the disadvantage.

so if you have enough registers to spare, it’s a free optimization. (kernel size really isn’t going to be a limiting factor). though once you run out of registers and start spilling over into local memory, then your performance will drop like a rock. so that’s your primary concern.

(and on that note, i’m really happy that nvidia doubled the register count per core from 8800 series to the 400’s. (and tripled the shared memory) much more balanced if you ask me.)

happyjack272 · October 29, 2010, 5:51pm

see if you can do with out so many __syncthreads(). those are big stallers.
and maybe you can use a lookup table for the activation function. the reciprocal and exponential functions are both transcendental functions and are thus quite slow.
with a lookup table in the right part of the memory hierarchy you can turn “really slow” into the speed of 2 integer additions.
you can store it in constant memory or if you store it as a texture you can get linear interpolation for free.

other than that i see a bunch of small improvements. such as pow << 1 can be replaced by pow = pow+pow. additions are faster than shifts. twice as fast if i recall correctly.
at that point you might as well start pow off at 1 and then do pow=pow+pow+1 instead, so you don’t have to do the extra (pow-1) to get your mask.

i might look deeper into the code and give you more. these kind of things are fun for me. but in anycase that’s my first impression.

happyjack272 · October 29, 2010, 5:51pm

see if you can do with out so many __syncthreads(). those are big stallers.
and maybe you can use a lookup table for the activation function. the reciprocal and exponential functions are both transcendental functions and are thus quite slow.
with a lookup table in the right part of the memory hierarchy you can turn “really slow” into the speed of 2 integer additions.
you can store it in constant memory or if you store it as a texture you can get linear interpolation for free.

other than that i see a bunch of small improvements. such as pow << 1 can be replaced by pow = pow+pow. additions are faster than shifts. twice as fast if i recall correctly.
at that point you might as well start pow off at 1 and then do pow=pow+pow+1 instead, so you don’t have to do the extra (pow-1) to get your mask.

i might look deeper into the code and give you more. these kind of things are fun for me. but in anycase that’s my first impression.

nuliknol · October 29, 2010, 6:57pm

I was benchmarking the code and found out that indeed using syncthreads takes a lot of time. Now I am changing everything.

Great tips, I think I can speed up about 100x comparing to what I have now, if I do all these optimizations.

Thanks

see if you can do with out so many __syncthreads(). those are big stallers.

and maybe you can use a lookup table for the activation function. the reciprocal and exponential functions are both transcendental functions and are thus quite slow.

with a lookup table in the right part of the memory hierarchy you can turn “really slow” into the speed of 2 integer additions.

you can store it in constant memory or if you store it as a texture you can get linear interpolation for free.

other than that i see a bunch of small improvements. such as pow << 1 can be replaced by pow = pow+pow. additions are faster than shifts. twice as fast if i recall correctly.

at that point you might as well start pow off at 1 and then do pow=pow+pow+1 instead, so you don’t have to do the extra (pow-1) to get your mask.

i might look deeper into the code and give you more. these kind of things are fun for me. but in anycase that’s my first impression.

nuliknol · October 29, 2010, 6:57pm

I was benchmarking the code and found out that indeed using syncthreads takes a lot of time. Now I am changing everything.

Great tips, I think I can speed up about 100x comparing to what I have now, if I do all these optimizations.

Thanks

see if you can do with out so many __syncthreads(). those are big stallers.

and maybe you can use a lookup table for the activation function. the reciprocal and exponential functions are both transcendental functions and are thus quite slow.

with a lookup table in the right part of the memory hierarchy you can turn “really slow” into the speed of 2 integer additions.

you can store it in constant memory or if you store it as a texture you can get linear interpolation for free.

other than that i see a bunch of small improvements. such as pow << 1 can be replaced by pow = pow+pow. additions are faster than shifts. twice as fast if i recall correctly.

at that point you might as well start pow off at 1 and then do pow=pow+pow+1 instead, so you don’t have to do the extra (pow-1) to get your mask.

i might look deeper into the code and give you more. these kind of things are fun for me. but in anycase that’s my first impression.

Topic		Replies	Views
Yet another performance question CUDA Programming and Performance	16	4270	February 12, 2009
'for' loop performance hacks? CUDA Programming and Performance	17	10687	February 28, 2009
2 problems with my matrix multiplication code CUDA Programming and Performance	5	1983	April 14, 2012
How to get more Gflops ? :) CUDA Programming and Performance	21	27674	September 12, 2008
Slow performance with nested for loops ? Very slow compilation and execution with nested for loops CUDA Programming and Performance	19	3884	July 11, 2010
unexpected unroll issue "unroll" changes output for the better CUDA Programming and Performance	3	2996	May 9, 2008
Speedy general reduction code ( 83.5 % of peak) Works for any size CUDA Programming and Performance	44	30473	October 29, 2010
sequential sum within a kernel. CUDA Programming and Performance	23	5050	September 8, 2008
Summing threads CUDA Programming and Performance	3	3171	June 7, 2011
Newbie question about applying a scalar function to a matrix CUDA Programming and Performance	11	1570	July 4, 2013

Wil unroling loops speed up the kernel ?

Related topics