Effective way to solve small TSP in a single CUDA core?

I am trying to utilize column generation method to solve Vehicle Routing Problem.

My column is defined as a single TSP route, so I have to solve numerous TSPs to find solution of VRP.

Thus, I am trying to solve small TSP in a single CUDA core with cdt algorithm (https://www.researchgate.net/publication/234777567_Algorithm_750_CDT_A_Subroutine_for_the_exact_solution_of_large-scale_asymmetric_traveling_salesman_problems)

However, it is very slow in CUDA core. Calculation time becomes 1.0 seconds even the number of cities are 20.

The scale of TSP I want to solve is about 50 cities and smaller.

Is there any way to effectively solve numerous small TSPs in a single CUDA core?

As far as I searched about this, utilizing cuSOLVER(https://docs.nvidia.com/cuda/cusolver/index.html) to a TSP in a single CUDA core looks promising solution… but no conviction about this.

Thank you very much.