How do you calculate the square-root of a scalar using cuda math api?

pettes · May 17, 2017, 12:49pm

I’m trying to convert a C program I have into CUDA. In the original C program I took the square-root of a scalar (wtw) by first typing #include “math.h” and then called the sqrt function like this: *wtw = sqrt(*wtw);.

Now that I’m converting the program into CUDA code I want to do the exact same thing, but I cant get it to work. I’ve been reading the CUDA math API reference manual but I don’t understand it and there are no clear examples. This is what my code looks like:

…
/* Create a wtw variable on the device */
double *wtw;
cudaMalloc(&wtw, (p*p) * sizeof(double));
…
/* wtw becomes the output from a cublasDgemv call */
cublasDgemv(handle, CUBLAS_OP_T, n, p, &alpha, w, n, w, incx, &beta, wtw, incx); 

/* Try to calculate the square-root of wtw (but fail) */
wtw = sqrt(wtw);

When I run it I get the error: “no instance of overloaded function “sqrt” matches the argument list”.
The Mathematical function documentation simply says: ‘CUDA mathematical functions are always available in device code’, and regarding the use of sqrt: ‘device double sqrt (double x)’. What does that even mean?

Should I include something at the start of my program to make it work? What am I doing wrong? If someone could provide a simple example on how to do this I would greatly appreciate it.

Thanks.

cbuchner1 · May 17, 2017, 1:18pm

it’s amazing how people trip over C/C++ language problems when trying to do CUDA.

The compiler even tells you what’s wrong. It does not know a sqrt function that accepts what you fed it.

double *wtw; // this is a pointer.

try

*wtw = sqrt(*wtw);

to compute and store the square root of the element that the pointer points to (which is a double, for which there is a sqrt function defined)

Alternatively you could use array notation which may become handy if wtw contains several elements that need their square roots taken.

wtw[0] = sqrt(wtw[0]); // if needed you can add a loop to go over several elements of wtw

pettes · May 17, 2017, 1:27pm

Hi. Thanks for the answer. That is exactly how I wrote it in my original C program (which is really a MEX function for Matlab). So initially I tried writing it like that in my CUDA conversion of the program as well.

And it complies without errors. But when I try to run it with *wtw = sqrt(*wtw) Matlab crashes completely and it doesn’t work. But as soon as I comment out that line it works again and I get the correct results without the square root…

Robert_Crovella · May 17, 2017, 2:05pm

It’s crashing because wtw is a pointer to a device memory location (as a result of cudaMalloc).

This:

*wtw

dereferences that device pointer in host code.

That is illegal in cuda, and if you do it you will get a seg fault (crash).

If you want to do the in-place square root of the first element of the array pointed to by wtw, you will need to do that in device code, e.g. by launching a kernel.

Or you could do a cudaMemcpy to copy that element to the host (say, a temporary double variable), take the square root of it in host code, then cudaMemcpy the result back to the first location in the wtw array.

cbuchner1 · May 17, 2017, 2:51pm

Alternatively we might try to use the CUDA unified memory architecture. The required allocation function is called cudaMallocManaged(). Data can then be accessed by both the device and the host.

I am not sure if cublas API calls play nice with unified memory, this is something to try and find out.

Christian

pettes · May 17, 2017, 3:10pm

Thank you very much, by temporarily transferring the variable to the host and doing the sqrt there it worked just like you said!

Do you know which of the two methods would be considered preferable in terms performance? Is it slow to cudaMemcpy a single scalar between device->host->device compared to launching a kernel on that scalar?

cbuchner1 · May 17, 2017, 3:15pm

Yes there is an overhead associated with the memcpy, as a transaction has to be made over the PCIe bus.

But either way you will probably need to have the result available on the host. It would probably not make a notable difference if the sqrt() is called host or device side, unless we’re talking about thousands of required sqrt() calls - then the GPU might become faster.