Fastest approach to implementing Min/Max using Cuda

Suppose I wanted to compute the minimum and maximum of an array using Cuda. Using the parallel reduction example as a starting point there are (at least) three alternative strategies I can think of. I could simply modfy the example to compute the minimum, and use a second copy to compute the maximum. But I presume it would be faster to compute the minimum and maximum simultaneously. On the first call to “reduce” I’d take the array of values and produce an array of (min/max) values. The subsequent calls to reduce would then work on arrays of (min/max) partial results. Now I could store all the min values contiguously, followed by all the max values. Or I could interleave the min and max values. From the perspective of coalesced memory accesses any idea which approach would be faster/preferable?

Process min and max simultaneously, as this problem is entirely bandwidth bound and you will thus save half of the bandwidth.
Whether you store the min/max values separately or interleaved makes little difference. But in the interleaved case use a float2 to hold both the min and the max value, or you will kill coalescing and thus performance. Using float2 will also halve the number of memory transactions (each of them twice as wide), but as long as memory accesses are properly aligned, the benefits of that will be minimal.

Process min and max simultaneously, as this problem is entirely bandwidth bound and you will thus save half of the bandwidth.
Whether you store the min/max values separately or interleaved makes little difference. But in the interleaved case use a float2 to hold both the min and the max value, or you will kill coalescing and thus performance. Using float2 will also halve the number of memory transactions (each of them twice as wide), but as long as memory accesses are properly aligned, the benefits of that will be minimal.

Hi!

I took a few minutes and modified my previous reduction code that i posted on this forum to find max and min values. I manage to reach 83.6 % of peak

Results:

GTS250 @ 70.6 GB/s - Finding min and max

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 47.47		   67.24 %		 88.3			Pass

 2097152		 51.15		   72.45 %		 164.0		   Pass

 4194304		 53.20		   75.35 %		 315.4		   Pass

 8388608		 55.84		   79.09 %		 600.9		   Pass

 16777216		58.09		   82.28 %		 1155.3				  Pass

 33554432		59.03		   83.61 %		 2273.9				  Pass

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		58.03		   82.20 %		 1011.8				  Pass

 14680119		57.97		   82.12 %		 1012.9				  Pass

 18875600		57.74		   81.78 %		 1307.7				  Pass

 7434886		 55.07		   78.00 %		 540.1		   Pass

 1501294		 48.94		   69.32 %		 122.7		   Pass

 15052598		56.74		   80.37 %		 1061.2				  Pass

 3135229		 50.21		   71.12 %		 249.8		   Pass

 8422202		 55.23		   78.23 %		 610.0		   Pass

The reduction code reached up to 91% of peak on Fermi cards so i would expect similar results from this implementation.

I haven’t spent much time on it so I can’t make any promises but its seems stable at first look.

my_max_min.cu (8.63 KB)

nvcc my_max_min.cu --ptxas-options="-v" -arch=sm_11 -maxrregcount 40 -use_fast_math -O3

Jim

Hi!

I took a few minutes and modified my previous reduction code that i posted on this forum to find max and min values. I manage to reach 83.6 % of peak

Results:

GTS250 @ 70.6 GB/s - Finding min and max

 N			   [GB/s]		  [perc]		  [usec]		  test

 1048576		 47.47		   67.24 %		 88.3			Pass

 2097152		 51.15		   72.45 %		 164.0		   Pass

 4194304		 53.20		   75.35 %		 315.4		   Pass

 8388608		 55.84		   79.09 %		 600.9		   Pass

 16777216		58.09		   82.28 %		 1155.3				  Pass

 33554432		59.03		   83.61 %		 2273.9				  Pass

Non-base 2 tests!

N			   [GB/s]		  [perc]		  [usec]		  test

 14680102		58.03		   82.20 %		 1011.8				  Pass

 14680119		57.97		   82.12 %		 1012.9				  Pass

 18875600		57.74		   81.78 %		 1307.7				  Pass

 7434886		 55.07		   78.00 %		 540.1		   Pass

 1501294		 48.94		   69.32 %		 122.7		   Pass

 15052598		56.74		   80.37 %		 1061.2				  Pass

 3135229		 50.21		   71.12 %		 249.8		   Pass

 8422202		 55.23		   78.23 %		 610.0		   Pass

The reduction code reached up to 91% of peak on Fermi cards so i would expect similar results from this implementation.

I haven’t spent much time on it so I can’t make any promises but its seems stable at first look.

[attachment=23323:my_max_min.cu]

nvcc my_max_min.cu --ptxas-options="-v" -arch=sm_11 -maxrregcount 40 -use_fast_math -O3

Jim

Hi, Jim

did you test your code on Fermi, it seems that the results is not correct. here is my testing of your code:

nvcc my_max_min.cu --ptxas-options="-v" -arch=sm_11 -maxrregcount 40 -use_fast_math -O3

my_max_min.cu(362): warning: variable “tail” was declared but never referenced

my_max_min.cu(362): warning: variable “tail” was declared but never referenced

ptxas info : Compiling entry function ‘Z12find_min_maxILi32ELi64EEvPfS0’ for ‘sm_11’

ptxas info : Used 4 registers, 528+16 bytes smem, 4 bytes cmem[1]

ptxas info : Compiling entry function ‘_Z20find_min_max_dynamicILi64EEvPfS0_iii’ for ‘sm_11’

ptxas info : Used 7 registers, 540+16 bytes smem, 4 bytes cmem[1]

ptxas info : Compiling entry function ‘Z12find_min_maxILi32768ELi64EEvPfS0’ for ‘sm_11’

ptxas info : Used 11 registers, 528+16 bytes smem, 12 bytes cmem[1]

ptxas info : Compiling entry function ‘Z12find_min_maxILi16384ELi64EEvPfS0’ for ‘sm_11’

ptxas info : Used 9 registers, 528+16 bytes smem, 12 bytes cmem[1]

ptxas info : Compiling entry function ‘Z12find_min_maxILi8192ELi64EEvPfS0’ for ‘sm_11’

ptxas info : Used 9 registers, 528+16 bytes smem, 12 bytes cmem[1]

ptxas info : Compiling entry function ‘Z12find_min_maxILi4096ELi64EEvPfS0’ for ‘sm_11’

ptxas info : Used 9 registers, 528+16 bytes smem, 12 bytes cmem[1]

ptxas info : Compiling entry function ‘Z12find_min_maxILi2048ELi64EEvPfS0’ for ‘sm_11’

ptxas info : Used 9 registers, 528+16 bytes smem, 12 bytes cmem[1]

[xzhu@gpu001 ~]$ ./a.out

GTS250 @ 70.6 GB/s - Finding min and max

N [GB/s] [perc] [usec] test

1048576 74.47 105.49 % 56.3 Fail

2097152 79.00 111.90 % 106.2 Fail

4194304 82.50 116.85 % 203.4 Fail

8388608 89.82 127.22 % 373.6 Fail

16777216 91.57 129.70 % 732.9 Fail

33554432 92.99 131.71 % 1443.4 Fail

Non-base 2 tests!

N [GB/s] [perc] [usec] test

14680102 90.97 128.85 % 645.5 Fail

14680119 90.97 128.85 % 645.5 Fail

18875600 90.47 128.14 % 834.6 Fail

7434886 88.37 125.18 % 336.5 Fail

1501294 75.65 107.15 % 79.4 Fail

15052598 86.64 122.71 % 695.0 Fail

3135229 74.61 105.68 % 168.1 Fail

8422203 88.57 125.45 % 380.4 Fail

any clue why it fails? thank you.

zhu

The [font=“Courier New”]shared[/font] variables need to be declared as [font=“Courier New”]volatile[/font] on Fermi (and it doesn’t hurt pre-Fermi, as omitting the __syncthreads() without declaring the shared variables volatile was never really safe, it just happened to work due to the different architectures).

thanks, tera. It works!

Ah, yes. I changed it in my code but forgot to update here on the forums. Many thanks to tera! :-)

EDIT: note also the hard coded numbers for “GTS250 @ 70.6 GB/s” - You would need to change that to whatever hardware u are using for the bandwidth utilization numbers to make any sense.

Looked at your previous code with index values and the index values don’t match. Any suggestions on what’s wrong here?

// fix indices only finding max with index stored in indices[0] and max in out[0]

	if(smem_max[0] == max)

	{

		indices[blockIdx.x] = max_index; // MAX

	}
//in the kernel call _dynamic with the offset values

	if(smem_max[0] == max)

	{

		if(max_index >= num_blocks) // from tail part

		{

			indices[0] = max_index;

		}

		else

		{

			int index = indices[max_index];

			indices[0] = index;

		}

	}