I have a function that computes a value on a vector, I have checked and it has the possibility to be parallelized. I would like to make it a CUDA device function. I am in the process of doing this.
However the operation that this function does involves some sum, multiplications and in the end a square root.
The internal calculations will be done in device memory, then the function will return this value that then in the main function I will transfer to the host, to display or any other purpose.
However, I still have to calculate square root on the value and I think I cannot apply std::sqrt()
to do this,
My question is, how can I do this with CUDA?
(One suggestion of course is just to return the un-squared value, transfer it to host and then sqrt there, but that defeats the modularization purpose of the function. I wonder if there is other way)