I extended the code i posted earlier on these forums so that good performance is achieved for arbitrary array sizes as well.
I managed to achieve up to 83.5% of peak on my GTS250, maybe other cards can do better.
GTS250 @ 70.6 GB/s
N [GB/s] [perc] [usec] test
1048576 50.25 71.17 % 83.5 Pass
2097152 53.43 75.68 % 157.0 Pass
4194304 54.96 77.85 % 305.2 Pass
8388608 57.79 81.86 % 580.6 Pass
16777216 58.55 82.93 % 1146.2 Pass
33554432 58.96 83.51 % 2276.4 Fail ( cpu malloc issue ?)
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 57.71 81.74 % 1017.5 Pass
14680119 57.74 81.79 % 1017.0 Pass
18875600 58.02 82.19 % 1301.2 Pass
7434886 51.17 72.48 % 581.2 Pass
1501294 43.93 62.22 % 136.7 Pass
15052598 50.68 71.79 % 1188.0 Pass
3135229 50.58 71.65 % 247.9 Pass
8422202 54.93 77.81 % 613.3 Pass
So below is the code which includes the test runs. Notice the failure at N = 33554432 this seems to be due to the CPU not giving me the right result… Anyone know why? External Image
Anyways this is the fastest general reduction code I’ve seen on GPUs, correct me if I’m wrong here!
[attachment=23305:my_reduction.cu]
compiled with: nvcc my_reduction.cu --ptxas-options=“-v” -arch=sm_11 -maxrregcount 40 -use_fast_math
I’m not sure about this but the blockSize parameter might need tweaking for newer better cards than my old GTS250.
EDIT: Updated file…
Please publish also your QR code ;-)
Please publish also your QR code ;-)
Hello,
I’ve not looked carefully at the code, but there are lots of “mul24”. It seems this is a (potentially big) performance loss on Fermi cards !
Thank you for sharing this
Hello,
I’ve not looked carefully at the code, but there are lots of “mul24”. It seems this is a (potentially big) performance loss on Fermi cards !
Thank you for sharing this
Doesn’t quite do so well on my GTX470:
GTX470 @ 133.90 GB/s
N [GB/s] [perc] [usec] test
1048576 77.12 57.59 % 54.4 Pass
2097152 90.13 67.31 % 93.1 Pass
4194304 91.90 68.64 % 182.6 Pass
8388608 100.14 74.79 % 335.1 Pass
16777216 105.78 79.00 % 634.4 Pass
33554432 106.15 79.28 % 1264.4 Fail
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 110.63 82.62 % 530.8 Pass
14680119 110.66 82.64 % 530.7 Pass
18875600 97.65 72.93 % 773.2 Pass
7434886 79.38 59.28 % 374.6 Pass
1501294 56.49 42.19 % 106.3 Pass
15052598 78.84 58.88 % 763.7 Pass
3135229 78.75 58.81 % 159.2 Pass
8422203 96.67 72.20 % 348.5 Pass
Doesn’t quite do so well on my GTX470:
GTX470 @ 133.90 GB/s
N [GB/s] [perc] [usec] test
1048576 77.12 57.59 % 54.4 Pass
2097152 90.13 67.31 % 93.1 Pass
4194304 91.90 68.64 % 182.6 Pass
8388608 100.14 74.79 % 335.1 Pass
16777216 105.78 79.00 % 634.4 Pass
33554432 106.15 79.28 % 1264.4 Fail
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 110.63 82.62 % 530.8 Pass
14680119 110.66 82.64 % 530.7 Pass
18875600 97.65 72.93 % 773.2 Pass
7434886 79.38 59.28 % 374.6 Pass
1501294 56.49 42.19 % 106.3 Pass
15052598 78.84 58.88 % 763.7 Pass
3135229 78.75 58.81 % 159.2 Pass
8422203 96.67 72.20 % 348.5 Pass
Hello,
I’ve not looked carefully at the code, but there are lots of “mul24”. It seems this is a (potentially big) performance loss on Fermi cards !
Thank you for sharing this
Hi, thanks for looking into that. The “mul24” in the first “reduce” kernel are actually totally unnecessary on any architecture since it unrolls the for loop completely. I think it should be fine but if you have the energy perhaps you could change it and have a try.
The “reduce_dynamic” kernel does however perform better because of this on older archs, but its relative runtime should be very small compared to the “reduce” kernel and shouldnt give a major performance penalty.
Hello,
I’ve not looked carefully at the code, but there are lots of “mul24”. It seems this is a (potentially big) performance loss on Fermi cards !
Thank you for sharing this
Hi, thanks for looking into that. The “mul24” in the first “reduce” kernel are actually totally unnecessary on any architecture since it unrolls the for loop completely. I think it should be fine but if you have the energy perhaps you could change it and have a try.
The “reduce_dynamic” kernel does however perform better because of this on older archs, but its relative runtime should be very small compared to the “reduce” kernel and shouldnt give a major performance penalty.
Doesn’t quite do so well on my GTX470:
GTX470 @ 133.90 GB/s
N [GB/s] [perc] [usec] test
1048576 77.12 57.59 % 54.4 Pass
2097152 90.13 67.31 % 93.1 Pass
4194304 91.90 68.64 % 182.6 Pass
8388608 100.14 74.79 % 335.1 Pass
16777216 105.78 79.00 % 634.4 Pass
33554432 106.15 79.28 % 1264.4 Fail
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 110.63 82.62 % 530.8 Pass
14680119 110.66 82.64 % 530.7 Pass
18875600 97.65 72.93 % 773.2 Pass
7434886 79.38 59.28 % 374.6 Pass
1501294 56.49 42.19 % 106.3 Pass
15052598 78.84 58.88 % 763.7 Pass
3135229 78.75 58.81 % 159.2 Pass
8422203 96.67 72.20 % 348.5 Pass
It’s good that you’ve gotten similar peak results but it’s a bit disturbing that it varies so much for arrays of similar size.
I have a suspiscion that the findBlockSize isn’t allways doing it’s intended job (it was a quick hack…). Could you give it a try to manually set whichSize = 5 or even better experiment with bigger blockSizes ( 65536*2 ) ?
EDIT: What is your memory interface? Just realized this might really be prone to partition camping… Nvidia says not PC on Fermi but the good Mr. Volkov here recently showed that it is still present…
Doesn’t quite do so well on my GTX470:
GTX470 @ 133.90 GB/s
N [GB/s] [perc] [usec] test
1048576 77.12 57.59 % 54.4 Pass
2097152 90.13 67.31 % 93.1 Pass
4194304 91.90 68.64 % 182.6 Pass
8388608 100.14 74.79 % 335.1 Pass
16777216 105.78 79.00 % 634.4 Pass
33554432 106.15 79.28 % 1264.4 Fail
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 110.63 82.62 % 530.8 Pass
14680119 110.66 82.64 % 530.7 Pass
18875600 97.65 72.93 % 773.2 Pass
7434886 79.38 59.28 % 374.6 Pass
1501294 56.49 42.19 % 106.3 Pass
15052598 78.84 58.88 % 763.7 Pass
3135229 78.75 58.81 % 159.2 Pass
8422203 96.67 72.20 % 348.5 Pass
It’s good that you’ve gotten similar peak results but it’s a bit disturbing that it varies so much for arrays of similar size.
I have a suspiscion that the findBlockSize isn’t allways doing it’s intended job (it was a quick hack…). Could you give it a try to manually set whichSize = 5 or even better experiment with bigger blockSizes ( 65536*2 ) ?
EDIT: What is your memory interface? Just realized this might really be prone to partition camping… Nvidia says not PC on Fermi but the good Mr. Volkov here recently showed that it is still present…
The QR code is being used commercially… What I just posted is something I did on my sparetime because I simply enjoy :-)
The QR code is being used commercially… What I just posted is something I did on my sparetime because I simply enjoy :-)
I extended the code i posted earlier on these forums so that good performance is achieved for arbitrary array sizes as well.
I managed to achieve up to 83.5% of peak on my GTS250, maybe other cards can do better.
GTS250 @ 70.6 GB/s
N [GB/s] [perc] [usec] test
1048576 50.25 71.17 % 83.5 Pass
2097152 53.43 75.68 % 157.0 Pass
4194304 54.96 77.85 % 305.2 Pass
8388608 57.79 81.86 % 580.6 Pass
16777216 58.55 82.93 % 1146.2 Pass
33554432 58.96 83.51 % 2276.4 Fail ( cpu malloc issue ?)
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 57.71 81.74 % 1017.5 Pass
14680119 57.74 81.79 % 1017.0 Pass
18875600 58.02 82.19 % 1301.2 Pass
7434886 51.17 72.48 % 581.2 Pass
1501294 43.93 62.22 % 136.7 Pass
15052598 50.68 71.79 % 1188.0 Pass
3135229 50.58 71.65 % 247.9 Pass
8422202 54.93 77.81 % 613.3 Pass
So below is the code which includes the test runs. Notice the failure at N = 33554432 this seems to be due to the CPU not giving me the right result… Anyone know why? External Media
Anyways this is the fastest general reduction code I’ve seen on GPUs, correct me if I’m wrong here!
[attachment=23296:my_reduction.cu]
compiled with: nvcc my_reduction.cu --ptxas-options=“-v” -arch=sm_11 -maxrregcount 40 -use_fast_math
I’m not sure about this but the blockSize parameter might need tweaking for newer better cards than my old GTS250.
Here’s my warmed up GTX275:
GTS275 @ 140 GB/s
N [GB/s] [perc] [usec] test
1048576 54.1 38.6 77.6 Pass
2097152 88.4 63.2 94.9 Pass
4194304 105.4 75.3 159.2 Pass
8388608 115.3 82.4 291.0 Pass
16777216 111.1 79.3 604.3 Pass
33554432 113.2 80.9 1185.4 Pass
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 128.3 91.6 457.8 Pass
14680119 128.4 91.7 457.2 Pass
18875600 120.6 86.2 625.9 Pass
7434886 99.2 70.9 299.7 Pass
9386455 98.7 70.5 380.3 Pass
16495925 87.1 62.2 757.8 Pass
4280953 102.4 73.2 167.2 Pass
8247688 90.8 64.8 363.4 Pass
(had to) change the timing back to cutil on my windows box and being lazy.
I extended the code i posted earlier on these forums so that good performance is achieved for arbitrary array sizes as well.
I managed to achieve up to 83.5% of peak on my GTS250, maybe other cards can do better.
GTS250 @ 70.6 GB/s
N [GB/s] [perc] [usec] test
1048576 50.25 71.17 % 83.5 Pass
2097152 53.43 75.68 % 157.0 Pass
4194304 54.96 77.85 % 305.2 Pass
8388608 57.79 81.86 % 580.6 Pass
16777216 58.55 82.93 % 1146.2 Pass
33554432 58.96 83.51 % 2276.4 Fail ( cpu malloc issue ?)
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 57.71 81.74 % 1017.5 Pass
14680119 57.74 81.79 % 1017.0 Pass
18875600 58.02 82.19 % 1301.2 Pass
7434886 51.17 72.48 % 581.2 Pass
1501294 43.93 62.22 % 136.7 Pass
15052598 50.68 71.79 % 1188.0 Pass
3135229 50.58 71.65 % 247.9 Pass
8422202 54.93 77.81 % 613.3 Pass
So below is the code which includes the test runs. Notice the failure at N = 33554432 this seems to be due to the CPU not giving me the right result… Anyone know why? External Media
Anyways this is the fastest general reduction code I’ve seen on GPUs, correct me if I’m wrong here!
[attachment=23296:my_reduction.cu]
compiled with: nvcc my_reduction.cu --ptxas-options=“-v” -arch=sm_11 -maxrregcount 40 -use_fast_math
I’m not sure about this but the blockSize parameter might need tweaking for newer better cards than my old GTS250.
Here’s my warmed up GTX275:
GTS275 @ 140 GB/s
N [GB/s] [perc] [usec] test
1048576 54.1 38.6 77.6 Pass
2097152 88.4 63.2 94.9 Pass
4194304 105.4 75.3 159.2 Pass
8388608 115.3 82.4 291.0 Pass
16777216 111.1 79.3 604.3 Pass
33554432 113.2 80.9 1185.4 Pass
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 128.3 91.6 457.8 Pass
14680119 128.4 91.7 457.2 Pass
18875600 120.6 86.2 625.9 Pass
7434886 99.2 70.9 299.7 Pass
9386455 98.7 70.5 380.3 Pass
16495925 87.1 62.2 757.8 Pass
4280953 102.4 73.2 167.2 Pass
8247688 90.8 64.8 363.4 Pass
(had to) change the timing back to cutil on my windows box and being lazy.
Here is an update:
It now achieves 84 % on my card and I seem to be getting better overall results.
GTS250 @ 70.6 GB/s
N [GB/s] [perc] [usec] test
1048576 50.48 71.50 % 83.1 Pass 0
2097152 53.94 76.40 % 155.5 Pass 0
4194304 55.81 79.04 % 300.6 Pass 0
8388608 57.79 81.86 % 580.6 Pass 0
16777216 58.71 83.16 % 1143.1 Pass 0
33554432 59.44 84.19 % 2258.1 Fail 0
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 59.05 83.64 % 994.5 Pass 38
14680119 59.04 83.62 % 994.6 Pass 55
18875600 58.51 82.87 % 1290.5 Pass 1232
7434886 57.16 80.96 % 520.3 Pass 646
1501294 52.02 73.68 % 115.4 Pass 110
15052598 57.96 82.09 % 1038.9 Pass 1846
3135229 53.08 75.19 % 236.2 Pass 1789
8422202 57.25 81.09 % 588.5 Pass 826
[attachment=17986:my_reduction.cu]
cheers!
Here is an update:
It now achieves 84 % on my card and I seem to be getting better overall results.
GTS250 @ 70.6 GB/s
N [GB/s] [perc] [usec] test
1048576 50.48 71.50 % 83.1 Pass 0
2097152 53.94 76.40 % 155.5 Pass 0
4194304 55.81 79.04 % 300.6 Pass 0
8388608 57.79 81.86 % 580.6 Pass 0
16777216 58.71 83.16 % 1143.1 Pass 0
33554432 59.44 84.19 % 2258.1 Fail 0
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 59.05 83.64 % 994.5 Pass 38
14680119 59.04 83.62 % 994.6 Pass 55
18875600 58.51 82.87 % 1290.5 Pass 1232
7434886 57.16 80.96 % 520.3 Pass 646
1501294 52.02 73.68 % 115.4 Pass 110
15052598 57.96 82.09 % 1038.9 Pass 1846
3135229 53.08 75.19 % 236.2 Pass 1789
8422202 57.25 81.09 % 588.5 Pass 826
[attachment=23304:my_reduction.cu]
cheers!
You can get high performance easier if you load multiple floats at once. Something like:
float a0 = in[0*threads];
float a1 = in[1*threads];
float a2 = in[2*threads];
float a3 = in[3*threads];
sum += a0+a1+a2+a3;
In fact, simple sum += in[0threads]+in[1 threads]+in[2threads]+in[3 threads] should have same effect.
In this case you don’t need high occupancy. Say, in memcopy I get 87% of pin bandwidth on GTX480 at only 17% occupancy if I do 16 loads at once. And if fetching data in float4, I can get 87% of pin bandwidth at only 8% occupancy!
Don’t get bogged down by occupancy considerations. Occupancy is overrated.
Vasily
You can get high performance easier if you load multiple floats at once. Something like:
float a0 = in[0*threads];
float a1 = in[1*threads];
float a2 = in[2*threads];
float a3 = in[3*threads];
sum += a0+a1+a2+a3;
In fact, simple sum += in[0threads]+in[1 threads]+in[2threads]+in[3 threads] should have same effect.
In this case you don’t need high occupancy. Say, in memcopy I get 87% of pin bandwidth on GTX480 at only 17% occupancy if I do 16 loads at once. And if fetching data in float4, I can get 87% of pin bandwidth at only 8% occupancy!
Don’t get bogged down by occupancy considerations. Occupancy is overrated.
Vasily
Yeah that version is a bit better on Fermi:
GTX470 @ 133.9 GB/s
N [GB/s] [perc] [usec] test
1048576 92.77 69.29 % 45.2 Pass 0
2097152 101.64 75.91 % 82.5 Pass 0
4194304 106.68 79.67 % 157.3 Pass 0
8388608 115.34 86.14 % 290.9 Pass 0
16777216 121.06 90.41 % 554.3 Pass 0
33554432 121.84 90.99 % 1101.6 Fail 0
Non-base 2 tests!
N [GB/s] [perc] [usec] test
14680102 121.44 90.70 % 483.5 Pass 38
14680119 121.46 90.71 % 483.4 Pass 55
18875600 120.28 89.83 % 627.7 Pass 1232
7434886 112.23 83.82 % 265.0 Pass 646
1501294 95.46 71.29 % 62.9 Pass 110
15052598 114.51 85.52 % 525.8 Pass 1846
3135229 93.84 70.08 % 133.6 Pass 1789
8422203 112.70 84.17 % 298.9 Pass 827
About the same on my GTX275 though.