Is there any non-linear least squares optimization library which could support CUDA C++?

I am trying to convert the C++ dlib solve_least_squares_lm() function into C++ CUDA GPU version. But after some tries, I found that the CUDA C++ could not recognize the solve_least_squares_lm() function.

My question is that in such case, what is the best solution to by-pass this issue? Should I re-write a non-linear least squares algorithm and just put __device__ before this subroutine function? Or is there any other similar existing CUDA C++ library available to use?