I have a function that computes a value on a vector, I have checked and it has the possibility to be parallelized. I would like to make it a CUDA device function. I am in the process of doing this.

However the operation that this function does involves some sum, multiplications and in the end a square root.

The internal calculations will be done in device memory, then the function will return this value that then in the main function I will transfer to the host, to display or any other purpose.

However, I still have to calculate square root on the value and I think I cannot apply `std::sqrt()`

to do this,

My question is, how can I do this with CUDA?

(One suggestion of course is just to return the un-squared value, transfer it to host and then sqrt there, but that defeats the modularization purpose of the function. I wonder if there is other way)