Creating a CUDA version of the std::next_permutation() function was one of my early CUDA projects many years ago and uses factorial decomposition;
njuffa helped with some of the math optimizations.
I also did a magic square as well, which is a simpler problem than generating non-repeating permutations.
The easy part is generating the permutations, and the the time consuming part is the test of a given permutation against an optimization function.
Your approach is not going to have anywhere near the performance of an implementation which uses decomposition. That old code of mine run on a single GTX 1080 finishes generating all 13! permutations and evaluating them against a linear (to N) test function in about 800 ms. That code caches the results and returns the optimal value along with the permutation which generated that value. It is all explained well on that site, though the performance data is old and run on newer GPUs is about 50% faster.