CUDA version of STL next_permutation()

CudaaduC · June 25, 2013, 3:04am

Here is a link to my CUDA version of the C++ Standard Template Library’s next_permutation() function.

There are two kernels, one which generates each permutation of n elements, and one which also evaluates each permutation, caches the optimal result(and the permutation which gave that result). This is done via a reduction/scan which certainly can be improved.

CudaaduC · July 6, 2013, 10:18pm

I have continued to test generating all the larger number of permutations from 13! and on. For just generating an array in device memory of all 13! permutations of indices it takes (6,227,020,800 threads times 13+(13*13))= 1.1333178e+12 steps(iterations).

For generating each unique permutation array in device memory which has the indices of all the possible 13! permutations it took my CUDA kernel 2 seconds. I do not have the time to wait for the CPU STL next_permutation() version but I know it will take at quite a bit longer, at least 100x.

CudaaduC · February 24, 2014, 8:12pm

UPDATE: Got a 30-40% speedup, check latest code.

CudaaduC · February 28, 2014, 7:02am

Using the nvvp profiling output, managed to get a surprising boost in performance for the generation/evaluation of permuted arrays for larger sizes(13 and up).

CudaaduC · March 4, 2014, 6:19am

Currently in the process of deleting my Github repositories, so if anyone is interested in this problem now would be the time to take a look.

The new version was able to Generate all 16! permutations of an local array, evaluate each permutation, scan/reduce and return optimal answer and a permutation responsible for that answer in 17.49 hours.

Not great given that a similar problem with a larger problem space:

[url]https://github.com/OlegKonings/CUDA_Matrix_Sum_Game[/url]

only took 25 minutes to generate, evaluate, scan…etc. Given the evaluation step is easier, but still…

That problem has 33,232,930,569,601 possible arrangements while 16! is only 20,922,789,888,000 possible arrangements.

When I profile the smaller incarnations of the code in nvvp it comes out very positive (‘no issues’), so am mystified by the huge time difference. It will take some time to figure this out, so going to pull it soon.

Topic		Replies	Views
Performance first project CUDA Programming and Performance	1	737	May 20, 2017
CUDA Large Number Permutation Algorithm CUDA Programming and Performance	2	1605	September 17, 2010
Possible to use the CUDA math API integer intrinsics to find the nth unset bit in a 32 bit int CUDA Programming and Performance	37	8454	March 1, 2015
how to implement double for loops in CUDA CUDA Programming and Performance	23	15781	January 30, 2012
Best way to accelerate for loops in kernel? CUDA Programming and Performance cuda , kernel	5	506	December 13, 2023
speed nvcc compiler CUDA Programming and Performance	1	2187	January 3, 2014
non recursive permute CUDA Programming and Performance	4	1093	October 13, 2014
Simple test, unexpected results: more calculations in each thread, less GPU occupancy time! CUDA Programming and Performance	5	1127	May 27, 2013
Looking for kernel performance suggestions CUDA Programming and Performance	12	57	August 23, 2024
New to Tesla/CUDA questions Just a few questions. CUDA Programming and Performance	7	7916	October 24, 2007

CUDA version of STL next_permutation()

Related topics