Function Sin Cos Vs table of value strange comparaison (for me)

Beleys · April 8, 2008, 9:19am

Hi,

I have implemented a little example of calculus rotation of point.

For that, I have use two different methods.

The first use directly the trigonometric method and the second, use tables of values loading in the texture unit.

In result, all the GPU version are better as the classical C version (build with the table of value) but it’s the trigonometric GPU version witch obtain the best result (8818 ms for the C version, 1636 ms for the GPU using table, and 1188 ms for trigo GPU version).

I am a little disappointed by this result … What do you think about it? It’s normal or my code isn’t optimal?

Tanks for your advice

++ Beleys

ps : I add the twho kernel

Version texture

__global__ void calculAlphaBetaBrutFor(float *dom, float * res,float AzimutX100, float cosTl, float sinTl,float MMPI, float pi, int nbPoint)

{

	int alpha, beta;

	int indice, Pai, Pdir;

	const unsigned int tidX = threadIdx.x; // RÃ©cupÃ©ration de la position

	const unsigned int bx = blockIdx.x; 

	const unsigned int idBlocX = blockDim.x; 

	indice = tidX +bx *idBlocX;	

	__syncthreads();	

	if (indice < nbPoint){  

  Pdir=(int) dom[indice*2];

  Pai = (int)dom[indice*2+1];

  float sinai= tex1Dfetch(texSinus100P,Pai+9000);

  float cosai = tex1Dfetch(texCosinus100P,Pai+9000);

  float sinda_cosai=tex1Dfetch(texSinus100,Pdir);

  sinda_cosai =sinda_cosai *cosai;

 beta = tex1Dfetch(texACosinusP,(int) ((sinTl*sinda_cosai+sinai*cosTl)) + 120000);

  float acos100 =  tex1Dfetch(texCosinus100,Pdir);

  float unSsinus =  tex1Dfetch(texUnSurSinus1024_100k,beta);

  alpha=  tex1Dfetch(texACosinusP,(int) ((acos100*cosai* unSsinus))+ 120000);

  if (cosTl*sinda_cosai-sinai*sinTl<0)

  	alpha=-alpha;

  

  res[indice*2] = alpha;

  res[indice*2+1] = beta;

	}

}

Version trigo

__global__ void calculAlphaBetaBrutTer(float *dom, float * res,float AzimutX100, float cosTl, float sinTl, float MMPI, float pi, int nbPoint){

	// Variable interne

	float ai, dir, aai, cosai, Sinai, ddir, sinda_cosai;

	float beta, beta1, cosDir, alpha1, alpha; Pai, Pdir;

	float unSurSinus1024;

	int indice;

	const unsigned int tidX = threadIdx.x; // RÃ©cupÃ©ration de la position

	const unsigned int bx = blockIdx.x; 

	indice = tidX +bx * numTh; // *idBlocX;

	

	if (indice < nbPoint){

  //RÃ©cupÃ©ration des donnÃ©es

  Pdir= dom[indice*2];

  Pai = dom[indice*2+1];

  ai = Pai;

  dir = Pdir + AzimutX100;

  aai = ai/100 * MMPI;

  cosai = cosf ( aai);

  sinai = sinf (aai);

  ddir = dir/100 * MMPI;

  sinda_cosai = sinf (ddir); 

  sinda_cosai = sinda_cosai * cosai;

  beta1 = sinTl * sinda_cosai; 

  beta1 = beta1 + sinai*cosTl;

 beta1 += 120000;

  if (beta1 < 20000){

  	beta = 1024 * 180;

  }else{

  	if (beta1 > 220000){

    beta = 0;

  	}else{

    beta =  1024 * acosf( ((beta1-200)/100000.0 - 1)) *180/pi;

  	}

  }

  cosDir = cos( dir/100 * MMPI);

  unSurSinus1024 = 100000.0 / ( sinf( beta/1024 * MMPI )   ); 

  alpha1 = cosDir * cosai * unSurSinus1024;

 alpha1 += 120000;

  if (alpha1 < 20000){

  	alpha = 1024 * 180;

  }else{

  	if (alpha1 > 220000){

    alpha = 0;

  	}else{

    alpha =  1024 * acosf( ((alpha1-200)/100000.0 - 1)) *180/pi;

  	}

  }

  if (cosTl*sinda_cosai-sinai*sinTl<0)

  	alpha=-alpha;

 res[indice*2] = alpha;

  res[indice*2+1] = beta;

	}

}

Simon_Green · April 8, 2008, 11:35am

Our GPUs support single-cycle sin and cos instructions in hardware, so using look-up tables will not be faster.

Some performance tips - it looks like your load and stores are not coalesced (you should use the int2 type), and there’s no need for the __syncthreads().

Simon_Green · April 9, 2008, 9:07am

A couple of other tips:

(1) When evaluating both sine and cosine of the same argument, use of
sincosf() is recommended. E.g. instead of

  cosai = cosf ( aai);
  sinai = sinf (aai);

try

  sincosf(aai, &sinai, &cosai);

(2) If some loss of accuracy can be tolerated, you can use the hardware sin/cos instructions directly (without refinement):

 __sincosf(aai, &sinai, &cosai);

Beleys · April 15, 2008, 9:45am

Thanks for your advice,

using the __sinf et __cosf (and __sincosf) function instead of the classical sin and cos allow to me to win 30% of evaluation time and without decrease my global accurate.

Thanks a lot

++ beleys

Topic		Replies	Views
A faster and more accurate implementation of sincosf() CUDA Programming and Performance	25	10002	August 6, 2017
trigonometric functions standard c v/s cuda CUDA Programming and Performance	13	6027	October 25, 2015
__device__ and __host__ qualifiers in same function CUDA Programming and Performance	4	3314	February 20, 2012
Look-Up Table vs __sincosf for Large-Scale Random Phase Calculations in Radio Astronomy Pipeline CUDA Programming and Performance cuda , kernel	20	187	December 30, 2025
native sincos() function? CUDA Programming and Performance	3	4947	March 9, 2007
A more accurate (and potentially faster) double-precision sincospi() implementation CUDA Programming and Performance	0	1753	August 12, 2015
[SOLVED] Njuffa's sincosf() vs __sinf() + __cosf() and current sincosf() CUDA Programming and Performance	5	2549	January 26, 2019
Accuracy-optimized implementation of tanf(), without performance impact CUDA Programming and Performance	1	575	July 5, 2022
Fastmath functions Speed or accuracy CUDA Programming and Performance	8	21720	April 16, 2009
COS and something strange cos results are not the same CUDA Programming and Performance	6	6113	December 10, 2007

Function Sin Cos Vs table of value strange comparaison (for me)

Related topics