Thanks Sylvain, fixedpoint method is better than previous integerdivision formula for n=100.
first I define your fixedpoint method as method 5
[codebox]device unsigned int fixedpoint_modulo(unsigned int &x )
{
unsigned int k, r ;
unsigned int n = 256 ;
unsigned int one_over_n = 0x01000000; // 1/n * 2^32
k = __umulhi(x, one_over_n);
r = x  n*k ;
return r ;
}[/codebox]
As far as I am concerned, fixedpoint method solve x = k*n + r, 0<=r<n by following step
step 1: compute reciprocal of divisor, say 2^32/n = ceil(2^32/n)  e, 0<=e<1, 0<=ne<n1
and define one_over_n = ceil(2^32/n), then pass it into kernel
step 2: multiply x (32bit) with one_over_n (32bit) to 64bit result and extract most significant 32bits,
__umulhi(x, one_over_n) = floor( (x/2^32)*(2^32/n + e) ) = floor( x/n + xe/2^32)
hence if we want to ask k = __umulhi(x, one_over_n) = floor(x/n), then we must require xe/2^32 < 1/n
problem: for example, if (x,n) satisfies xne <= 2^31, then it is O.K to use fixedpoint method.
however in my problem, x can be large, for example x = n^3 = 256^3
In order to keep condition xe/2^32 < 1/n, I need another variant of fixedpoint method, say
fixedpoint with shift pbit
suppose divisor 2^p < n <= 2^(p+1), we use the same idea of fixedpoint method
step 1: 2^(32+p)/n = ceil(2^(32+p)/n)  e, 0<=e<1, 0<=ne<n1
define one_over_n = ceil(2^(32+p)/n)
step 2: find k = floor(x/n) in two steps
k_hat = __umulhi(x, one_over_n) ;
k = k_hat >> p ;
however k_hat >> p = floor( (x/2^(32+p))*(2^(32+p)/n + e) ) = floor( x/n + xe/2^(32+p))
if we ask k = floor(x/n), then xe/2^(32+p)<1/n is necessary.
In other words, x(n/2^p) < 2^32
this condition relax value of x, say x can be larger.
I call “fixedpoint with shift pbit” as method 6
method 6: p = 7
[codebox]device unsigned int fixedpoint_modulo_v2(unsigned int &x )
{
unsigned int k, r ;
unsigned int n = 256 ;
unsigned int one_over_n = 0x01000000; // 1/n * 2^32
k = __umulhi(x, one_over_n);
k = k >> 7 ;
r = x  n*k ;
return r ;
}[/codebox]
experiment on Tesla C1060, result is
method 1: use “double” to do modulo
method 2: use “float” to do modulo
method 3: use shift to do modulo, onlt feasible when n is power of 2
method 4: use integerdivision for n=100
method 5: fixedpoint method wihout shift
method 6: fixedpoint method with shift
[codebox]±±±±±+
method 1  method 2  method 3  method 4  method 5  method 6 
±±±±±+
24.7  5.0  2.0  40.9  11.5  13.2 
±±±±±+ [/codebox]
From above table, performance order: method 5 > method 6 > method 1.
This means that fixedpoint method is more good.
next I show two examples on Tesla C1060
Example 1: given 3D data real X(1:n1, 1:n2, 1:n3), do zero extension to real Y(1:n1,1:n2,0:2*n3+1) by
Y(1:n1, 1:n2, 0 ) = 0
Y(1:n1, 1:n2, 1:n3) = X(1:n1,1:n2,1*n3)
Y(1:n1, 1:n2, n3+1:2*n3+1) = 0
method 1: use “double” to do modulo
method 5: fixedpoint method wihout shift
method 6: fixedpoint method with shift
[codebox]±±±±+
n1,n2,n3  device  GPU method 1  GPU method 5  GPU method 6 
 memory  (time, bandwidth)  (time, bandwidth)  (time, bandwidth) 
±±±±+
32,32,32  784 kB  0.025 ms, 29.50 GB/s  0.022 ms, 33.60 GB/s  0.022 ms, 33.29 GB/s 
±±±±+
64,64,64  6.1 MB  0.111 ms, 53.42 GB/s  0.103 ms, 57.24 GB/s  0.104 ms, 57.05 GB/s 
±±±±+
128,128,128  48.3 MB  0.883 ms, 53.35 GB/s  0.825 ms, 57.09 GB/s  0.829 ms, 56.83 GB/s 
±±±±+
256,256,256  385 MB  6.161 ms, 61.02 GB/s  5.739 ms, 65.51 GB/s  5.740 ms, 65.50 GB/s 
±±±±+
512,512,512  3076 MB 47.180 ms, 63.67 GB/s 45.726 ms, 65.69 GB/s 45.766 ms, 65.64 GB/s 
±±±±+[/codebox]
conclusion:

method 6 is almost the same as method 5, this is reasonable since method only has one more shift operation
than method 5

method 6 is better than method 1 but no significant speedup due to memory latency.
in fact, method 1 is good enough since bandwidth of Tesla C1060 is about 74GB/s from bandwidthTest.exe in SDK
Example 2: given 3D data real X(1:n1, 1:n2, 1:n3), do transpose operation to Y(1:n2, 1:n3, 1:n1) by
Y(j,k,i) <— X(i,j,k)
method 2: use “float” to do modulo
method 5: fixedpoint method wihout shift
[codebox]±±+
n1,n2,n3  GPU method 2  GPU method 5 
 (time, bandwidth)  (time, bandwidth) 
±±+
63,63,63  0.092 ms, 40.63 GB/s  0.092 ms, 40.47 GB/s 
±±+
64,64,64  0.092 ms, 42.62 GB/s  0.092 ms, 42.66 GB/s 
±±+
127,127,127  0.654 ms, 46.70 GB/s  0.636 ms, 47.98 GB/s 
±±+
128,128,128  1.015 ms, 30.77 GB/s  1.018 ms, 30.70 GB/s 
±±+
255,255,255  5.973 ms, 41.36 GB/s  5.955 ms, 41.49 GB/s 
±±+
256,256,256  15.789 ms, 15.83 GB/s  15.776 ms, 15.85 GB/s 
±±+
511,511,511  40.983 ms, 48.51 GB/s  41.807 ms, 47.56 GB/s 
±±+
512,512,512 127.575 ms, 15.68 GB/s 127.547 ms, 15.68 GB/s 
±±+ [/codebox]
conclusion: although “modulo by float” is 2x faster than “modulo by fixedpoint”, in this
example they have the same performance. I think latency between global memory to shared memory
dominants such that modulo operations are hidden.
what’s the interesting is
if one look at IEEE standard, then

“float” has 23bit mantissa, so critical path of multiplication of “floats” should be 23bit

“double” has 52bit mantissa, so critical path of multiplication of “doubles” should be 52bit

critical path of multiplication of “32bit integer” should be 64bit
at frist glance, one may wonder why time of method 5 is only 1/2 time of method 1.
Finally I must offer my heartfelt thanks to you, I can learn about integerdivision.