error in modulo operation

dear all:

I want to use 2D grid and 2D block to label 3D data set, in my method,

I need modulo operation, say

given positive integer denominator n, for any positive integer x,

compute x = k * n + r, 0 <= r < n

for example, n = 33

x = 1 ==> k = 0, r = 1

x = 33 ==> k = 1, r = 0

x = 35 ==> k = 1, r = 2

the cpu code is

[codebox] int k = (int) floor( ((float)x) / ((float)n) )

int r = x - k * n[/codebox]

However I found that some (x,n) pair would not be correct when computing in GPU

The test kernel in GPU is

[codebox]/*

  • A[j] = floor( j / n )

  • numerator[j] = j

  • denominator[j] = n

*/

static global void modulo( unsigned int *A, float *numerator, float *denominator,

	unsigned int n, unsigned int size )

{

float tmp1, tmp2 ; 	

unsigned int idx = blockIdx.x * BLOCK_DIM + threadIdx.x;



if ( idx < size ){

	tmp1 = __uint2float_rz( idx ) ;

	numerator[idx] = tmp1 ;		

	tmp2 = __uint2float_rz( n ) ;

	denominator[idx] = tmp2 ;		

	tmp1 = floorf( tmp1 / tmp2 ) ; 	

	A[idx] = __float2uint_rz( tmp1 ) ;

}

}

[/codebox]

for example, when n = 33

x = 33 ==> floor(x/n) = 0 in GPU

x = 66 ==> floor(x/n) = 1 in GPU

x = 132 ==> floor(x/n) = 3 in GPU

x = 264 ==> floor(x/n) = 7 in GPU

however __uint2float_rz(x) shows correct result

(of course, CPU version gives correct result)

I found in ProgrammingGuide 2.3, page 120,

it says x/y has maximum error 2 ulp, so better coding style is

int k = (int) floor( ((float)x + 0.5) / ((float)n)  )

However I wonder why floor(x/x) is not 1 in GPU when x = 33 or 37 or 41

any idea?

you mean that x/x is giving you something like 0.999999999 right?

what result is giving you 34/33 and 65/33 ?

I write another code to dump the results

[codebox]

/*

  • A_modulo[j] = floor( j / factor )

  • A[j] = j / factor

*/

static global void modulo_v2( unsigned int *A_modulo, float *A,

	unsigned int factor, unsigned int N )

{

float tmp1, tmp2 ; 	

unsigned int idx = blockIdx.x * BLOCK_DIM + threadIdx.x;

if ( idx < N ){

	tmp1 = __uint2float_rz( idx ) ;

	tmp2 = __uint2float_rz( factor ) ;

	tmp1 = tmp1 / tmp2 ;

	A[idx] = tmp1 ;

	tmp1 = floorf( tmp1 ) ; 	

	A_modulo[idx] = __float2uint_rz( tmp1 ) ;

}

}[/codebox]

A_modulo[33] = 0 ~= floor(33/33) = 1

A[33] = (33/33) = 9.9999994e-001

A_modulo[34] = 1 is equal to floor(34/33) = 1

A[34] = (34/33) = 1.0303030e+000

A_modulo[65] = 1 is equal to floor(65/33) = 1

A[65] = (65/33) = 1.9696969e+000

A_modulo[66] = 1 ~= floor(66/33) = 2

A[66] = (66/33) = 1.9999999e+000

I think that 34/33 and 65/66 are good enough

From the CUDA guide, Appendix A:
" (floating point) division is implemented via the reciprocal in a non standard compliant way." This means you don’t have a guaranteed 0ULP result even when you’re dividing x by x.

So you’re getting some small roundoff error on the divide. And you’re magnifying it by making the floor() function be sensitive at exactly the transition where the value is.
So a trivial 1ULP error in the division will make your floor() results wrong.

Three solutions.

  1. Make your floor() robust to roundoff by just making it a rounding. rint() is exactly designed for this. You could also use floor(x+0.5f) but that’s not necessary.
  2. Tell CUDA you want IEEE division by using _fdiv_rn().
  3. Why cast to floats and back? Just keep the math as ints and then it’s deterministic. Integer divides are slow in CUDA but this looks like a once-per-thread compute, so it probably wouldn’t be a big deal.

Dont forget that floats only have 23 bits of accuracy, while ints have 32 bits of accuracy. If you convert from a float to an int you will lose some of the low order bits which will result in an incorrect modulo.

thank you, SPWorley, from your three solutions

  1. rint(x) returns integral value nearest to x, this is what I want for modulo operation

  2. _fdiv_rn(x/y) = floor(x/y) is the same as CPU version, good

  3. integer modulo also works

I write three versions to test 3-D data copy under double precision

say A(1:n1,1:n2,1:n3) → B(1:n1,1:n2,1:n3) via B(i,j,k) := A(i,j,k)

platform: winxp pro64, vc2005, icpc 11.1.035 -O2, Tesla C1060, driver 190.38, cuda 2.3

in my kernel code, I need to represet x = n3 * t1 + zIndex.

method 1: use floor( (x+0.5)/y)

[codebox] float tmp1, tmp2 ;

unsigned int t1, t2, zIndex ;

float eps = 0.5f ;

	tmp1 = __uint2float_rz( x ) ;

	tmp2 = __uint2float_rz( n3 ) ;

	tmp1 = floorf( (tmp1 + eps ) / tmp2 ) ;

	t1 = __float2uint_rz( tmp1 ) ; 

	zIndex = x - n3*t1 ;[/codebox]

method 2: use x/y for x,y are integer

[codebox]

unsigned int t1, t2, zIndex ;

t1 = x / n3 ;

	zIndex = x - n3*t1 ;[/codebox]

method 3: use _fdiv_rn(x/y)

[codebox] float tmp1, tmp2 ;

unsigned int t1, t2, zIndex ;

	tmp1 = __uint2float_rz( x ) ;

	tmp2 = __uint2float_rz( n3 ) ;

	tmp1 = __fdiv_rz( tmp1, tmp2 ) ;

	t1 = __float2uint_rz( tmp1 ) ; 

	zIndex = x - n3*t1 ;[/codebox]

experimental result is

[codebox]-------------±------------±----------------------±---------------------±---------------------±----------------------+

n1,n2,n3 | size of one | CPU (time, bandwidth) | GPU method 1 | GPU method 2 | GPU method 3 |

         | matrix      |                       | (time, bandwidth)    | (time, bandwidth)    | (time, bandwidth)     |

-------------±------------±----------------------±---------------------±---------------------±----------------------+

32,32,32 | 256 kB | 0.214 ms, 2.28 GB/s | 0.017 ms, 28,27 GB/s | 0.021 ms, 23.54 GB/s | 0.017 ms, 29.24 GB/s |

-------------±------------±----------------------±---------------------±---------------------±----------------------+

64,64,64 | 2 MB | 1.437 ms, 2.72 GB/s | 0.065 ms, 60.35 GB/s | 0.092 ms, 42.27 GB/s | 0.190 ms, 20.58 GB/s |

-------------±------------±----------------------±---------------------±---------------------±----------------------+

128,128,128 | 16 MB | 10.825 ms, 2.89 GB/s | 0.413 ms, 75.67 GB/s | 0.659 ms, 47.40 GB/s | 1.428 ms, 21.88 GB/s |

-------------±------------±----------------------±---------------------±---------------------±----------------------+

256,256,256 | 128 MB | 80.349 ms, 3.11 GB/s | 4.979 ms, 50.21 GB/s | 5.059 ms, 49.42 GB/s | 11.414 ms, 21.90 GB/s |

-------------±------------±----------------------±---------------------±---------------------±----------------------+

512,512,512 | 1024 MB |1232.807 ms, 1.62 GB/s |36.537 ms, 54.74 GB/s |40.389 ms, 49.52 GB/s | 90.605 ms, 22.07 GB/s |

-------------±------------±----------------------±---------------------±---------------------±----------------------+[/codebox]

from above table, we conclude that method 1 is the best, then method 2, the worst is method 3.

we understand that method 2 is slow due to integer modulo, however why method 3 is so slow,

only 1/2 performance of method 2.

So I think that method 1 is good for me so far.

Oh, that’s ALL of your kernel? I thought it was just the start of a larger kernel.

In that case, I would expect you’d be memory bandwidth limited. Your timing measurements show you’re not though.

Since your value of n is constant, you can likely precompute 1.0/n and use that everywhere. Just pass 1.0/n in as another float kernel argument.

But in fact you can do even better. You can precompute (on the CPU) a nice one-integer-multiply version of division by a known constant.

This is a common trick used by compilers… it’s outlined in the great book Hacker’s Delight.
The basic idea is to find an integer constant c ~= 2^32/n and then use that as an integer multiply. The x/n result is then in the top bits and can be pulled out with just a shift.
There’s lots of little details in the book, in a supplement PDF file, and at the web site.

thank you again, SPWorley, I am first time heard “integer division by constants”, it is an interesting topic.

as you said, I don’t post all parts of my kernel code, this is a 3-D data copy problem, just a model problem,

I want to use it as baseine, and then try to optimized 3-D data transpose problem,

for example, A(x,y,z) → B(y,z,x)

Remark: SDK has 2-D matrix transpose example, A(x,y) → B(y,x), I just use the same idea to do 3-D data transpose,

the point is “how to transform 2-D grid with 2-D thread block into index of 3-D data”

now follow your suggestions, I do another experiment

method 1: use floor((x+0.5)/y)

[codebox]static global void copy_3Ddata( doublereal *dst, doublereal *src,

		unsigned int n1, unsigned int n2, unsigned int n3,

		unsigned int Gy, unsigned int Gz, unsigned int k2 ) 

{

float eps = 0.5f ;

float tmp1, tmp2 ; 

unsigned int t1, t2, xIndex, yIndex, zIndex, index_in ;

unsigned int x = blockIdx.x * BLOCK_DIM_X + threadIdx.x;

unsigned int y = blockIdx.y * BLOCK_DIM_Y + threadIdx.y;

if( (x < Gz) && (y < Gy) ) {

// step 1: transform grid index to 3-D data index

// x = n3 * t1 + zIndex

// y = n2 * t2 + yIndex

// where (zIndex, yIndex): index to z-y slice, (t1, t2): index to x direction

	tmp1 = __uint2float_rz( x ) ;

	tmp2 = __uint2float_rz( n3 ) ;

	tmp1 = floorf( (tmp1 + eps ) / tmp2 ) ;

	t1 = __float2uint_rz( tmp1 ) ; 

	zIndex = x - n3*t1 ;



	tmp1 = __uint2float_rz( y ) ;

	tmp2 = __uint2float_rz( n2 ) ;

	tmp1 = floorf( (tmp1 + eps ) / tmp2 ) ;

	t2 = __float2uint_rz( tmp1 ) ; 

	yIndex = y - n2*t2 ;

	xIndex = t2 * k2 + t1 ;

// (xIndex, yIndex, zIndex) = ( i-1, j-1, k-1)

// row-major(i,j,k) = (i-1)n2n3 + (j-1)*n3 + (k-1)

	if ( xIndex < n1 ){

		index_in = ( xIndex * n2 + yIndex )*n3 + zIndex ;	

		dst[index_in] = src[index_in];			

	}

}// (x < Gz) && (y < Gy) 	

}

[/codebox]

method 2: pre-compute 1/y via host code, then pass it into kernel function

[codebox]static global void copy_3Ddata( doublereal *dst, doublereal *src,

		unsigned int n1, unsigned int n2, unsigned int n3,

		unsigned int Gy, unsigned int Gz, unsigned int k2, 

		float one_over_n2, float one_over_n3 ) 

{

float tmp1 ;

unsigned int t1, t2, xIndex, yIndex, zIndex, index_in ;

unsigned int x = blockIdx.x * BLOCK_DIM_X + threadIdx.x;

unsigned int y = blockIdx.y * BLOCK_DIM_Y + threadIdx.y;

if( (x < Gz) && (y < Gy) ) {

	tmp1 = __uint2float_rz( x ) ;

	tmp1 = tmp1 * one_over_n3 ;

	t1 = __float2uint_rz( tmp1 ) ; 

	zIndex = x - n3*t1 ;

	

	tmp1 = __uint2float_rz( y ) ;

	tmp1 = tmp1 * one_over_n2 ;

	t2 = __float2uint_rz( tmp1 ) ; 

	yIndex = y - n2*t2 ;

	

	xIndex = t2 * k2 + t1 ;

	if ( xIndex < n1 ){

		index_in = ( xIndex * n2 + yIndex )*n3 + zIndex ;	

		dst[index_in] = src[index_in];			

	}

}// (x < Gz) && (y < Gy) 	

}

[/codebox]

experimental result is

[codebox]-------------±------------±---------------------±---------------------+

n1,n2,n3 | size of one | GPU method 1 | GPU method 2 |

         | matrix      | (time, bandwidth)    | (time, bandwidth)    |

-------------±------------±---------------------±---------------------+

32,32,32 | 256 kB | 0.017 ms, 28,27 GB/s | 0.015 ms, 31.93 GB/s |

-------------±------------±---------------------±---------------------+

64,64,64 | 2 MB | 0.065 ms, 60.35 GB/s | 0.062 ms, 63.03 GB/s |

-------------±------------±---------------------±---------------------+

128,128,128 | 16 MB | 0.413 ms, 75.67 GB/s | 0.408 ms, 76.63 GB/s |

-------------±------------±---------------------±---------------------+

256,256,256 | 128 MB | 4.979 ms, 50.21 GB/s | 5.102 ms, 49.00 GB/s |

-------------±------------±---------------------±---------------------+

512,512,512 | 1024 MB |36.537 ms, 54.74 GB/s |37.268 ms, 53.67 GB/s |

-------------±------------±---------------------±---------------------+ [/codebox]

From above table, “pre-compute 1/y in host code” is not faster than “compute 1/y in kernel”, why?

I think that this is memory-bound problem, maybe “compute 1/y in kernel” is hide in memory latency.

on the other hand, if using “integer division by constants”, I try divisor = 100 by following formula

[codebox]// compute n = 100 * qe + re , 0<= re < 100

	q = (n>>1 ) + (n>> 3) + (n>>6) - (n>>10) +

	    (n>>12) + (n>>13) - (n>>16) ;	

	q = q + (q >> 20) ;

	q = q >> 6 ;

	r = n - q *100 ;

	qe = q + ((r+28) >> 7) ;

	re = n - 100*qe ;[/codebox]

I call it as method 3.

“divisor = 100” is equivalent to my problem with n1=n2=n3=100, the result is

[codebox]-------------±------------±---------------------±---------------------+

n1,n2,n3 | size of one | GPU method 1 | GPU method 3 |

         | matrix      | (time, bandwidth)    | (time, bandwidth)    | 

-------------±------------±---------------------±---------------------+

100,100,100 | 7.6 kB | 0.220 ms, 67.89 GB/s | 0.315 ms, 47.38 GB/s |

-------------±------------±---------------------±---------------------+[/codebox]

from above result, I think that direct compute floor(n/100) is still faster than

“division by constants” in 3-D matrix copy.

conclusion: I think that I will try another plane to map 2-D grid with 2-D thread block into index of 3-D data

and try above methods again

I absolutely admire the fact that you are trying and timing every option!
That’s the kind of hands-on experimentation that answers questions.

Your method 3 test is not conclusive, though. You can’t trust any sub-millisecond timing to be accurate.
I admit that I don’t even follow your implementation of method 3… but that’s not a criticism since the whole technique is intricate and hard to follow!

You might try this block-wise test harness to get clock-level accurate measurements. That code can time arbitrary user functions too… though you have a complication in that you’re computing both the division and the remainder.

Dear SPWorley: thanks for your opinion.

  1. “Your method 3 test is not conclusive, though. You can’t trust any sub-millisecond timing to be accurate.”

    in my experiment, two steps to do time profiling

    step 1: execute kernel function 200 times to remove warmup time of GPU

    step 2: execute kernel function 200 times again and use

        cutStartTimer(timer), cutStopTimer(timer) to record total execution time, then 
    
        compute average time.
    
  2. “You might try this block-wise test harness to get clock-level accurate measurements”

    I see your post “Measurements of different CUDA operator throughputs” and found two links

cuda_op_throughput.zip ( 3.32K )

timings.txt ( 4.76K )

however cuda_op_throughput.zip only contains one file timeop.cu and I don’t find include file “timeop.h”,

I think that i miss something, could you give me explicit link to the source code?

besides I must correct my approach to compute modulo operation, x = k * n + r, 0 <= r < n

my original idea is using float to compute quotient k, the cpu code is

[codebox] int k = (int) floor( ((float)x) / ((float)n) )

int r = x - k * n[/codebox]

now I call it as method 1.

[codebox] float tmp1 ;

	tmp1 = __uint2float_rz( x ) ;

	tmp1 = tmp1 * one_over_n ; // one_over_n = 1/n is computed from host code

	k = __float2uint_rz( tmp1 ) ;  

	r = x - n*k ;	[/codebox]

However I found that dividend x may be very large in my 3-D data, for exmaple x = 512^3,

shifter1 reminds me that "Dont forget that floats only have 23 bits of accuracy, while ints have 32 bits of accuracy.

If you convert from a float to an int you will lose some of the low order bits which will result in an incorrect modulo."

hence I must use “double” to compute quotient without digit loss. I call it as method 2.

method 2:

[codebox] float tmp1 ;

	tmp1 = __uint2double_rn( x ) ;

	tmp1 = tmp1 * one_over_n ;  // one_over_n = (1 + 1.E-12)/n

	k = __double2uint_rz( tmp1 ) ; 

	r = x - n*k ;[/codebox]

third, in order to find upper bound of bandwidth, I try special dimension N = 32, 64, 128, 256, 512

such that I can use “shift” operation to do division. That is the fastest way that I can expect.

I call this as method 3.

method 3

[codebox] k = x >> n_exp ; // n = 2^n_exp

	r = x - n*k ;	[/codebox]

for 3-D data copy (data is “double”), the experiment/s result is

[codebox]-------------±------------±---------------------±---------------------±---------------------+

n1,n2,n3 | size of one | GPU method 1 | GPU method 2 | GPU method 2 |

         | matrix      | (time, bandwidth)    | (time, bandwidth)    | (time, bandwidth)    |

-------------±------------±---------------------±---------------------±---------------------+

32,32,32 | 256 kB | 0.012 ms, 40.78 GB/s | 0.013 ms, 38.55 GB/s | 0.012 ms, 41.53 GB/s |

-------------±------------±---------------------±---------------------±---------------------+

64,64,64 | 2 MB | 0.057 ms, 69.06 GB/s | 0.058 ms, 67.40 GB/s | 0.056 ms, 69.14 GB/s |

-------------±------------±---------------------±---------------------±---------------------+

128,128,128 | 16 MB | 0.415 ms, 75.25 GB/s | 0.417 ms, 74.93 GB/s | 0.415 ms, 75.25 GB/s |

-------------±------------±---------------------±---------------------±---------------------+

256,256,256 | 128 MB | 3.288 ms, 76.03 GB/s | 3.293 ms, 75.92 GB/s | 3.289 ms, 76.02 GB/s |

-------------±------------±---------------------±---------------------±---------------------+

512,512,512 | 1024 MB |26.285 ms, 76.09 GB/s |26.313 ms, 76.01 GB/s |26.285 ms, 76.09 GB/s |

-------------±------------±---------------------±---------------------±---------------------+

[/codebox]

under platform , vc2005, Tesla C1060, driver 190.38, cuda2.3

From the table, all three methods attain practical maximum bandwidth

(from bandwidthTest.exe, my Tesla C1060 only has 74GB/s for pageable memory ),

so method 3 (fatest one) does not more speedup.

3-D data copy is too simple to distinguish three methods, overhead of modulo operation is hidden from

memory latency.

However I believe that such difference among three methods might be significant when do transpose operation,

so far I can only reach 16~30GB/s on 3-D data tranpose operation, say A(x,y,z) → B(y,z,x)

question: "You might try this block-wise test harness to get clock-level accurate measurements.

That code can time arbitrary user functions too"

now I use “block-wise test harness” to test several types of modulo operation.

in order to profile modulo operation, I add proto-type for “unsigned int” in timeop.h

[codebox]define UIDEF(num, func) \

struct uitest_ ## num { \

device static unsigned int op(unsigned int &a, unsigned int &b) \

   { return func; }                         \

const char *describe() { return #func ; } \

}

define UITEST(num) \

uikernel<uitest_ ## num, EVALLOOPS> <<<1,192>>>(12345, d_result); \

cudaMemcpy( &h_result, d_result, sizeof(int), cudaMemcpyDeviceToHost); \

printf(“UI” #num " %4.1lf %s\n", \

     8*(h_result-ibase+1.0)/(192*EVALLOOPS*UNROLLCOUNT),                 \

     uitest_ ## num ().describe());   [/codebox]

and

[codebox]template <class functor, unsigned int LOOPS>

global void uikernel(int initValue, int *cu_result)

{

unsigned int arg1=initValue; // unknown start at compile time so it can’t be unrolled

unsigned int arg2=arg1+7353251;

unsigned int accumulator=0;

int start=clock();

for (unsigned int i=0; i<LOOPS; i++)

#pragma unroll 99999

for (int j=0; j<UNROLLCOUNT/4; j++) {

arg1=arg1^arg2; arg2+=arg1;

accumulator^=functor().op(arg1, arg2);

arg1=arg1^arg2; arg2+=arg1;

accumulator&=functor().op(arg1, arg2);

arg1=arg1^arg2; arg2+=arg1;

accumulator^=functor().op(arg1, arg2);

arg1=arg1^arg2; arg2+=arg1;

accumulator&=functor().op(arg1, arg2);

}

if (threadIdx.x==0) {

*cu_result=clock()-start;

*(cu_result+1)=accumulator+arg1+arg2; // unused, but now compiler can't optimize away

}

}[/codebox]

also I test four methods to verify modulo operation x = k*n + r, 0<=r < n

method 1: use “double” to do modulo, say k = floor(x/n), this is a general form

[codebox]device unsigned int db_modulo(unsigned int &x )

{

double tmp1 ; 

unsigned int k, r ;

unsigned int n = 256 ;

double one_over_n = 0.00390625 ; // 1/n



tmp1 = __uint2double_rn( x ) ;

tmp1 = tmp1 * one_over_n ;

k = __double2uint_rz( tmp1 ) ; 		

r = x - n*k ;

return r ;

}[/codebox]

method 2: use “float” to do modulo, only works for small x and n

[codebox]device unsigned int float_modulo(unsigned int &x )

{

float tmp1 ; 

unsigned int k, r ;

unsigned int n = 256 ;

float one_over_n = 0.00390625f ; // 1/n

tmp1 = __uint2float_rz( x ) ;

tmp1 = tmp1 * one_over_n ;

k = __float2uint_rz( tmp1 ) ; 		

r = x - n*k ;

return r ;

}[/codebox]

method 3: use shift to do division, only for n = power of 2,

this is the fastest method

[codebox]device unsigned int shift_modulo(unsigned int &x )

{

unsigned int k, r ;

unsigned int n = 256 ;

unsigned int n_bit = 8 ; // n = 2^n_bit

k = x >> n_bit ;

r = x - n*k ;

return r ;

}[/codebox]

method 4: use integer division, we use formula for n = 100

[codebox]device unsigned int int_modulo_100(unsigned int &n )

{

unsigned int qe, re ;

unsigned int q, r ;

	

q = (n>>1 ) + (n>> 3) + (n>>6) - (n>>10) + (n>>12) + (n>>13) - (n>>16) ;			

q = q + (q >> 20) ;		

q = q >> 6 ;		

r = n - q *100 ;		

qe = q + ((r+28) >> 7) ; 		

re = n - 100*qe ;

return re ;

}[/codebox]

experiment on Tesla C1060, result is

[codebox]-------------±------------±-------------±----------------+

method 1 | method 2 | method 3 | method 4 |

-------------±------------±-------------±----------------+

24.7 | 5.0 | 2.0 | 40.9 |

-------------±------------±-------------±----------------+ [/codebox]

conclusion:

  1. method 3 is the fastest method as we expect

  2. method 2 is 5x faster than method 1, this is also reasonable since 8 SPs share one 64-bit FMAD

  3. method 4 is the slowest, I think that method 4 has 10 serial shift operations, that is why it becomes slowest.

Hence this interpret pervious post, integer-division is not fast for general n.

Great experiments.

Did you also try the fixed-point multiply method that Steve proposed?

It works with non-power-of-two divisors and is more accurate (but a bit slower) than the float multiplication method:

[codebox]device unsigned int fixedpoint_modulo(unsigned int &x )

{

unsigned int k, r ;

unsigned int n = 256 ;

unsigned int one_over_n = 0x01000000; // 1/n * 2^32

k = __umulhi(x, one_over_n);

r = x - n*k ;

return r ;

}[/codebox]

Thanks Sylvain, fixed-point method is better than previous integer-division formula for n=100.

first I define your fixedpoint method as method 5

[codebox]device unsigned int fixedpoint_modulo(unsigned int &x )

{

unsigned int k, r ;	

unsigned int n = 256 ;	

unsigned int one_over_n = 0x01000000; // 1/n * 2^32 	

k = __umulhi(x, one_over_n); 	

r = x - n*k ;	

return r ;

}[/codebox]

As far as I am concerned, fixedpoint method solve x = k*n + r, 0<=r<n by following step

step 1: compute reciprocal of divisor, say 2^32/n = ceil(2^32/n) - e, 0<=e<1, 0<=ne<n-1

    and define one_over_n =  ceil(2^32/n), then pass it into kernel

step 2: multiply x (32-bit) with one_over_n (32-bit) to 64-bit result and extract most significant 32-bits,

__umulhi(x, one_over_n) = floor( (x/2^32)*(2^32/n + e) ) = floor( x/n + xe/2^32)

hence if we want to ask k = __umulhi(x, one_over_n) = floor(x/n), then we must require xe/2^32 < 1/n

problem: for example, if (x,n) satisfies xne <= 2^31, then it is O.K to use fixedpoint method.

however in my problem, x can be large, for example x = n^3 = 256^3

In order to keep condition xe/2^32 < 1/n, I need another variant of fixedpoint method, say

fixedpoint with shift p-bit

suppose divisor 2^p < n <= 2^(p+1), we use the same idea of fixedpoint method

step 1: 2^(32+p)/n = ceil(2^(32+p)/n) - e, 0<=e<1, 0<=ne<n-1

 define one_over_n = ceil(2^(32+p)/n)

step 2: find k = floor(x/n) in two steps

 k_hat = __umulhi(x, one_over_n) ;

 k = k_hat >> p ; 

however k_hat >> p = floor( (x/2^(32+p))*(2^(32+p)/n + e) ) = floor( x/n + xe/2^(32+p))

if we ask k = floor(x/n), then xe/2^(32+p)<1/n is necessary.

In other words, x(n/2^p) < 2^32

this condition relax value of x, say x can be larger.

I call “fixedpoint with shift p-bit” as method 6

method 6: p = 7

[codebox]device unsigned int fixedpoint_modulo_v2(unsigned int &x )

{

unsigned int k, r ;	

unsigned int n = 256 ;	

unsigned int one_over_n = 0x01000000; // 1/n * 2^32 	

k = __umulhi(x, one_over_n); 	

k = k >> 7 ;

r = x - n*k ;	

return r ;

}[/codebox]

experiment on Tesla C1060, result is

method 1: use “double” to do modulo

method 2: use “float” to do modulo

method 3: use shift to do modulo, onlt feasible when n is power of 2

method 4: use integer-division for n=100

method 5: fixedpoint method wihout shift

method 6: fixedpoint method with shift

[codebox]-----------±-----------±----------±-----------±-----------±----------+

method 1 | method 2 | method 3 | method 4 | method 5 | method 6 |

-----------±-----------±----------±-----------±-----------±----------+

24.7 | 5.0 | 2.0 | 40.9 | 11.5 | 13.2 |

-----------±-----------±----------±-----------±-----------±----------+ [/codebox]

From above table, performance order: method 5 > method 6 > method 1.

This means that fixedpoint method is more good.

next I show two examples on Tesla C1060

Example 1: given 3-D data real X(1:n1, 1:n2, 1:n3), do zero extension to real Y(1:n1,1:n2,0:2*n3+1) by

     Y(1:n1, 1:n2, 0   ) = 0

     Y(1:n1, 1:n2, 1:n3) = X(1:n1,1:n2,1*n3)

     Y(1:n1, 1:n2, n3+1:2*n3+1) = 0

method 1: use “double” to do modulo

method 5: fixedpoint method wihout shift

method 6: fixedpoint method with shift

[codebox]-------------±------------±---------------------±---------------------±---------------------+

n1,n2,n3 | device | GPU method 1 | GPU method 5 | GPU method 6 |

         | memory      | (time, bandwidth)    | (time, bandwidth)    | (time, bandwidth)    |

-------------±------------±---------------------±---------------------±---------------------+

32,32,32 | 784 kB | 0.025 ms, 29.50 GB/s | 0.022 ms, 33.60 GB/s | 0.022 ms, 33.29 GB/s |

-------------±------------±---------------------±---------------------±---------------------+

64,64,64 | 6.1 MB | 0.111 ms, 53.42 GB/s | 0.103 ms, 57.24 GB/s | 0.104 ms, 57.05 GB/s |

-------------±------------±---------------------±---------------------±---------------------+

128,128,128 | 48.3 MB | 0.883 ms, 53.35 GB/s | 0.825 ms, 57.09 GB/s | 0.829 ms, 56.83 GB/s |

-------------±------------±---------------------±---------------------±---------------------+

256,256,256 | 385 MB | 6.161 ms, 61.02 GB/s | 5.739 ms, 65.51 GB/s | 5.740 ms, 65.50 GB/s |

-------------±------------±---------------------±---------------------±---------------------+

512,512,512 | 3076 MB |47.180 ms, 63.67 GB/s |45.726 ms, 65.69 GB/s |45.766 ms, 65.64 GB/s |

-------------±------------±---------------------±---------------------±---------------------+[/codebox]

conclusion:

  1. method 6 is almost the same as method 5, this is reasonable since method only has one more shift operation

    than method 5

  2. method 6 is better than method 1 but no significant speedup due to memory latency.

in fact, method 1 is good enough since bandwidth of Tesla C1060 is about 74GB/s from bandwidthTest.exe in SDK

Example 2: given 3-D data real X(1:n1, 1:n2, 1:n3), do transpose operation to Y(1:n2, 1:n3, 1:n1) by

Y(j,k,i) <— X(i,j,k)

method 2: use “float” to do modulo

method 5: fixedpoint method wihout shift

[codebox]-------------±----------------------±----------------------+

n1,n2,n3 | GPU method 2 | GPU method 5 |

         | (time, bandwidth)     | (time, bandwidth)     |  

-------------±----------------------±----------------------+

63,63,63 | 0.092 ms, 40.63 GB/s | 0.092 ms, 40.47 GB/s |

-------------±----------------------±----------------------+

64,64,64 | 0.092 ms, 42.62 GB/s | 0.092 ms, 42.66 GB/s |

-------------±----------------------±----------------------+

127,127,127 | 0.654 ms, 46.70 GB/s | 0.636 ms, 47.98 GB/s |

-------------±----------------------±----------------------+

128,128,128 | 1.015 ms, 30.77 GB/s | 1.018 ms, 30.70 GB/s |

-------------±----------------------±----------------------+

255,255,255 | 5.973 ms, 41.36 GB/s | 5.955 ms, 41.49 GB/s |

-------------±----------------------±----------------------+

256,256,256 | 15.789 ms, 15.83 GB/s | 15.776 ms, 15.85 GB/s |

-------------±----------------------±----------------------+

511,511,511 | 40.983 ms, 48.51 GB/s | 41.807 ms, 47.56 GB/s |

-------------±----------------------±----------------------+

512,512,512 |127.575 ms, 15.68 GB/s |127.547 ms, 15.68 GB/s |

-------------±----------------------±----------------------+ [/codebox]

conclusion: although “modulo by float” is 2x faster than “modulo by fixedpoint”, in this

example they have the same performance. I think latency between global memory to shared memory

dominants such that modulo operations are hidden.

what’s the interesting is

if one look at IEEE standard, then

  1. “float” has 23-bit mantissa, so critical path of multiplication of “floats” should be 23-bit

  2. “double” has 52-bit mantissa, so critical path of multiplication of “doubles” should be 52-bit

  3. critical path of multiplication of “32-bit integer” should be 64-bit

at frist glance, one may wonder why time of method 5 is only 1/2 time of method 1.

Finally I must offer my heartfelt thanks to you, I can learn about integer-division.