Suppose I wanted to compute the minimum and maximum of an array using Cuda. Using the parallel reduction example as a starting point there are (at least) three alternative strategies I can think of. I could simply modfy the example to compute the minimum, and use a second copy to compute the maximum. But I presume it would be faster to compute the minimum and maximum simultaneously. On the first call to “reduce” I’d take the array of values and produce an array of (min/max) values. The subsequent calls to reduce would then work on arrays of (min/max) partial results. Now I could store all the min values contiguously, followed by all the max values. Or I could interleave the min and max values. From the perspective of coalesced memory accesses any idea which approach would be faster/preferable?
Process min and max simultaneously, as this problem is entirely bandwidth bound and you will thus save half of the bandwidth.
Whether you store the min/max values separately or interleaved makes little difference. But in the interleaved case use a float2 to hold both the min and the max value, or you will kill coalescing and thus performance. Using float2 will also halve the number of memory transactions (each of them twice as wide), but as long as memory accesses are properly aligned, the benefits of that will be minimal.
Process min and max simultaneously, as this problem is entirely bandwidth bound and you will thus save half of the bandwidth.
Whether you store the min/max values separately or interleaved makes little difference. But in the interleaved case use a float2 to hold both the min and the max value, or you will kill coalescing and thus performance. Using float2 will also halve the number of memory transactions (each of them twice as wide), but as long as memory accesses are properly aligned, the benefits of that will be minimal.
Hi!
I took a few minutes and modified my previous reduction code that i posted on this forum to find max and min values. I manage to reach 83.6 % of peak
Results:
GTS250 @ 70.6 GB/s - Finding min and max
N [GB/s] [perc] [usec] test
1048576 47.47 67.24 % 88.3 Pass
2097152 51.15 72.45 % 164.0 Pass
4194304 53.20 75.35 % 315.4 Pass
8388608 55.84 79.09 % 600.9 Pass
16777216 58.09 82.28 % 1155.3 Pass
33554432 59.03 83.61 % 2273.9 Pass
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 58.03 82.20 % 1011.8 Pass
14680119 57.97 82.12 % 1012.9 Pass
18875600 57.74 81.78 % 1307.7 Pass
7434886 55.07 78.00 % 540.1 Pass
1501294 48.94 69.32 % 122.7 Pass
15052598 56.74 80.37 % 1061.2 Pass
3135229 50.21 71.12 % 249.8 Pass
8422202 55.23 78.23 % 610.0 Pass
The reduction code reached up to 91% of peak on Fermi cards so i would expect similar results from this implementation.
I haven’t spent much time on it so I can’t make any promises but its seems stable at first look.
my_max_min.cu (8.63 KB)
nvcc my_max_min.cu --ptxas-options=“-v” -arch=sm_11 -maxrregcount 40 -use_fast_math -O3
Jim
Hi!
I took a few minutes and modified my previous reduction code that i posted on this forum to find max and min values. I manage to reach 83.6 % of peak
Results:
GTS250 @ 70.6 GB/s - Finding min and max
N [GB/s] [perc] [usec] test
1048576 47.47 67.24 % 88.3 Pass
2097152 51.15 72.45 % 164.0 Pass
4194304 53.20 75.35 % 315.4 Pass
8388608 55.84 79.09 % 600.9 Pass
16777216 58.09 82.28 % 1155.3 Pass
33554432 59.03 83.61 % 2273.9 Pass
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 58.03 82.20 % 1011.8 Pass
14680119 57.97 82.12 % 1012.9 Pass
18875600 57.74 81.78 % 1307.7 Pass
7434886 55.07 78.00 % 540.1 Pass
1501294 48.94 69.32 % 122.7 Pass
15052598 56.74 80.37 % 1061.2 Pass
3135229 50.21 71.12 % 249.8 Pass
8422202 55.23 78.23 % 610.0 Pass
The reduction code reached up to 91% of peak on Fermi cards so i would expect similar results from this implementation.
I haven’t spent much time on it so I can’t make any promises but its seems stable at first look.
[attachment=23323:my_max_min.cu]
nvcc my_max_min.cu --ptxas-options=“-v” -arch=sm_11 -maxrregcount 40 -use_fast_math -O3
Jim
Hi, Jim
did you test your code on Fermi, it seems that the results is not correct. here is my testing of your code:
nvcc my_max_min.cu --ptxas-options=“-v” -arch=sm_11 -maxrregcount 40 -use_fast_math -O3
my_max_min.cu(362): warning: variable “tail” was declared but never referenced
my_max_min.cu(362): warning: variable “tail” was declared but never referenced
ptxas info : Compiling entry function ‘Z12find_min_maxILi32ELi64EEvPfS0’ for ‘sm_11’
ptxas info : Used 4 registers, 528+16 bytes smem, 4 bytes cmem[1]
ptxas info : Compiling entry function ‘_Z20find_min_max_dynamicILi64EEvPfS0_iii’ for ‘sm_11’
ptxas info : Used 7 registers, 540+16 bytes smem, 4 bytes cmem[1]
ptxas info : Compiling entry function ‘Z12find_min_maxILi32768ELi64EEvPfS0’ for ‘sm_11’
ptxas info : Used 11 registers, 528+16 bytes smem, 12 bytes cmem[1]
ptxas info : Compiling entry function ‘Z12find_min_maxILi16384ELi64EEvPfS0’ for ‘sm_11’
ptxas info : Used 9 registers, 528+16 bytes smem, 12 bytes cmem[1]
ptxas info : Compiling entry function ‘Z12find_min_maxILi8192ELi64EEvPfS0’ for ‘sm_11’
ptxas info : Used 9 registers, 528+16 bytes smem, 12 bytes cmem[1]
ptxas info : Compiling entry function ‘Z12find_min_maxILi4096ELi64EEvPfS0’ for ‘sm_11’
ptxas info : Used 9 registers, 528+16 bytes smem, 12 bytes cmem[1]
ptxas info : Compiling entry function ‘Z12find_min_maxILi2048ELi64EEvPfS0’ for ‘sm_11’
ptxas info : Used 9 registers, 528+16 bytes smem, 12 bytes cmem[1]
[xzhu@gpu001 ~]$ ./a.out
GTS250 @ 70.6 GB/s - Finding min and max
N [GB/s] [perc] [usec] test
1048576 74.47 105.49 % 56.3 Fail
2097152 79.00 111.90 % 106.2 Fail
4194304 82.50 116.85 % 203.4 Fail
8388608 89.82 127.22 % 373.6 Fail
16777216 91.57 129.70 % 732.9 Fail
33554432 92.99 131.71 % 1443.4 Fail
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 90.97 128.85 % 645.5 Fail
14680119 90.97 128.85 % 645.5 Fail
18875600 90.47 128.14 % 834.6 Fail
7434886 88.37 125.18 % 336.5 Fail
1501294 75.65 107.15 % 79.4 Fail
15052598 86.64 122.71 % 695.0 Fail
3135229 74.61 105.68 % 168.1 Fail
8422203 88.57 125.45 % 380.4 Fail
any clue why it fails? thank you.
zhu
The [font=“Courier New”]shared[/font] variables need to be declared as [font=“Courier New”]volatile[/font] on Fermi (and it doesn’t hurt pre-Fermi, as omitting the __syncthreads() without declaring the shared variables volatile was never really safe, it just happened to work due to the different architectures).
thanks, tera. It works!
Ah, yes. I changed it in my code but forgot to update here on the forums. Many thanks to tera! :-)
EDIT: note also the hard coded numbers for “GTS250 @ 70.6 GB/s” - You would need to change that to whatever hardware u are using for the bandwidth utilization numbers to make any sense.
Looked at your previous code with index values and the index values don’t match. Any suggestions on what’s wrong here?
// fix indices only finding max with index stored in indices[0] and max in out[0]
if(smem_max[0] == max)
{
indices[blockIdx.x] = max_index; // MAX
}
//in the kernel call _dynamic with the offset values
if(smem_max[0] == max)
{
if(max_index >= num_blocks) // from tail part
{
indices[0] = max_index;
}
else
{
int index = indices[max_index];
indices[0] = index;
}
}
This code fails on the Nvidia K80 even with the volatile added. The GPU just finds 0 for max and min. It works on me laptop GPU (NVIDIA GeForce RTX 3050 Laptop GPU). I have no idea why?