Thanks Sylvain, fixed-point method is better than previous integer-division formula for n=100.
first I define your fixedpoint method as method 5
[codebox]device unsigned int fixedpoint_modulo(unsigned int &x )
{
unsigned int k, r ;
unsigned int n = 256 ;
unsigned int one_over_n = 0x01000000; // 1/n * 2^32
k = __umulhi(x, one_over_n);
r = x - n*k ;
return r ;
}[/codebox]
As far as I am concerned, fixedpoint method solve x = k*n + r, 0<=r<n by following step
step 1: compute reciprocal of divisor, say 2^32/n = ceil(2^32/n) - e, 0<=e<1, 0<=ne<n-1
and define one_over_n = ceil(2^32/n), then pass it into kernel
step 2: multiply x (32-bit) with one_over_n (32-bit) to 64-bit result and extract most significant 32-bits,
__umulhi(x, one_over_n) = floor( (x/2^32)*(2^32/n + e) ) = floor( x/n + xe/2^32)
hence if we want to ask k = __umulhi(x, one_over_n) = floor(x/n), then we must require xe/2^32 < 1/n
problem: for example, if (x,n) satisfies xne <= 2^31, then it is O.K to use fixedpoint method.
however in my problem, x can be large, for example x = n^3 = 256^3
In order to keep condition xe/2^32 < 1/n, I need another variant of fixedpoint method, say
fixedpoint with shift p-bit
suppose divisor 2^p < n <= 2^(p+1), we use the same idea of fixedpoint method
step 1: 2^(32+p)/n = ceil(2^(32+p)/n) - e, 0<=e<1, 0<=ne<n-1
define one_over_n = ceil(2^(32+p)/n)
step 2: find k = floor(x/n) in two steps
k_hat = __umulhi(x, one_over_n) ;
k = k_hat >> p ;
however k_hat >> p = floor( (x/2^(32+p))*(2^(32+p)/n + e) ) = floor( x/n + xe/2^(32+p))
if we ask k = floor(x/n), then xe/2^(32+p)<1/n is necessary.
In other words, x(n/2^p) < 2^32
this condition relax value of x, say x can be larger.
I call “fixedpoint with shift p-bit” as method 6
method 6: p = 7
[codebox]device unsigned int fixedpoint_modulo_v2(unsigned int &x )
{
unsigned int k, r ;
unsigned int n = 256 ;
unsigned int one_over_n = 0x01000000; // 1/n * 2^32
k = __umulhi(x, one_over_n);
k = k >> 7 ;
r = x - n*k ;
return r ;
}[/codebox]
experiment on Tesla C1060, result is
method 1: use “double” to do modulo
method 2: use “float” to do modulo
method 3: use shift to do modulo, onlt feasible when n is power of 2
method 4: use integer-division for n=100
method 5: fixedpoint method wihout shift
method 6: fixedpoint method with shift
[codebox]-----------±-----------±----------±-----------±-----------±----------+
method 1 | method 2 | method 3 | method 4 | method 5 | method 6 |
-----------±-----------±----------±-----------±-----------±----------+
24.7 | 5.0 | 2.0 | 40.9 | 11.5 | 13.2 |
-----------±-----------±----------±-----------±-----------±----------+ [/codebox]
From above table, performance order: method 5 > method 6 > method 1.
This means that fixedpoint method is more good.
next I show two examples on Tesla C1060
Example 1: given 3-D data real X(1:n1, 1:n2, 1:n3), do zero extension to real Y(1:n1,1:n2,0:2*n3+1) by
Y(1:n1, 1:n2, 0 ) = 0
Y(1:n1, 1:n2, 1:n3) = X(1:n1,1:n2,1*n3)
Y(1:n1, 1:n2, n3+1:2*n3+1) = 0
method 1: use “double” to do modulo
method 5: fixedpoint method wihout shift
method 6: fixedpoint method with shift
[codebox]-------------±------------±---------------------±---------------------±---------------------+
n1,n2,n3 | device | GPU method 1 | GPU method 5 | GPU method 6 |
| memory | (time, bandwidth) | (time, bandwidth) | (time, bandwidth) |
-------------±------------±---------------------±---------------------±---------------------+
32,32,32 | 784 kB | 0.025 ms, 29.50 GB/s | 0.022 ms, 33.60 GB/s | 0.022 ms, 33.29 GB/s |
-------------±------------±---------------------±---------------------±---------------------+
64,64,64 | 6.1 MB | 0.111 ms, 53.42 GB/s | 0.103 ms, 57.24 GB/s | 0.104 ms, 57.05 GB/s |
-------------±------------±---------------------±---------------------±---------------------+
128,128,128 | 48.3 MB | 0.883 ms, 53.35 GB/s | 0.825 ms, 57.09 GB/s | 0.829 ms, 56.83 GB/s |
-------------±------------±---------------------±---------------------±---------------------+
256,256,256 | 385 MB | 6.161 ms, 61.02 GB/s | 5.739 ms, 65.51 GB/s | 5.740 ms, 65.50 GB/s |
-------------±------------±---------------------±---------------------±---------------------+
512,512,512 | 3076 MB |47.180 ms, 63.67 GB/s |45.726 ms, 65.69 GB/s |45.766 ms, 65.64 GB/s |
-------------±------------±---------------------±---------------------±---------------------+[/codebox]
conclusion:
-
method 6 is almost the same as method 5, this is reasonable since method only has one more shift operation
than method 5
-
method 6 is better than method 1 but no significant speedup due to memory latency.
in fact, method 1 is good enough since bandwidth of Tesla C1060 is about 74GB/s from bandwidthTest.exe in SDK
Example 2: given 3-D data real X(1:n1, 1:n2, 1:n3), do transpose operation to Y(1:n2, 1:n3, 1:n1) by
Y(j,k,i) <— X(i,j,k)
method 2: use “float” to do modulo
method 5: fixedpoint method wihout shift
[codebox]-------------±----------------------±----------------------+
n1,n2,n3 | GPU method 2 | GPU method 5 |
| (time, bandwidth) | (time, bandwidth) |
-------------±----------------------±----------------------+
63,63,63 | 0.092 ms, 40.63 GB/s | 0.092 ms, 40.47 GB/s |
-------------±----------------------±----------------------+
64,64,64 | 0.092 ms, 42.62 GB/s | 0.092 ms, 42.66 GB/s |
-------------±----------------------±----------------------+
127,127,127 | 0.654 ms, 46.70 GB/s | 0.636 ms, 47.98 GB/s |
-------------±----------------------±----------------------+
128,128,128 | 1.015 ms, 30.77 GB/s | 1.018 ms, 30.70 GB/s |
-------------±----------------------±----------------------+
255,255,255 | 5.973 ms, 41.36 GB/s | 5.955 ms, 41.49 GB/s |
-------------±----------------------±----------------------+
256,256,256 | 15.789 ms, 15.83 GB/s | 15.776 ms, 15.85 GB/s |
-------------±----------------------±----------------------+
511,511,511 | 40.983 ms, 48.51 GB/s | 41.807 ms, 47.56 GB/s |
-------------±----------------------±----------------------+
512,512,512 |127.575 ms, 15.68 GB/s |127.547 ms, 15.68 GB/s |
-------------±----------------------±----------------------+ [/codebox]
conclusion: although “modulo by float” is 2x faster than “modulo by fixedpoint”, in this
example they have the same performance. I think latency between global memory to shared memory
dominants such that modulo operations are hidden.
what’s the interesting is
if one look at IEEE standard, then
-
“float” has 23-bit mantissa, so critical path of multiplication of “floats” should be 23-bit
-
“double” has 52-bit mantissa, so critical path of multiplication of “doubles” should be 52-bit
-
critical path of multiplication of “32-bit integer” should be 64-bit
at frist glance, one may wonder why time of method 5 is only 1/2 time of method 1.
Finally I must offer my heartfelt thanks to you, I can learn about integer-division.