Originally published at: https://developer.nvidia.com/blog/accelerate-r-applications-cuda/

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through…

Thanks Patric, great overview!

We work in a Windows environment and have found very little compiled R application that can take advantage of CUDA. The matrix operations we do are highly divisible (perfect for CUDA) but most of what we do multi-threads over CPU cores.

Can you recommend an R solution that will allow us to parallel process with CUDA in a windows environment.

Thanks

Pat

Hi Patrick,

Thanks for your comments.

In your case for matrix operations, I suggest you employ cublasxt ( https://developer.nvidia.co... ) and/or magma (https://developer.nvidia.co... library from R. To apply these libraries, you need to write up your own interface functions as this post described.

On windows, you can build your interface functions to DLL file and then still load by dyn.load() in R. One simple way is to use Visual Studio (IDE) to create/build a DLL project including your CUDA libraries call or CDUA C/C++/FORTRAN code. You can refer this document (http://docs.nvidia.com/cuda... to setup the development environment.

Hope this can help you!

-Patric

Hi Patric,

Thank you very much for this nice tutorial!

I have a question especially regarding the FFT.

I was using your example:

num <- 4

set.seed(5)

z <- complex(real = stats::rnorm(num), imaginary = stats::rnorm(num))

cpu <- fft(z)

gpu <- cufft1D(z)

However, if I compute the sums, I get (very slightly) different results in the imaginary part:

> sum(cpu)

[1] -3.363422+6.845763i

> sum(gpu)

[1] -3.363422+6.845764i

It does not look severe, but since I am using FFT on a very precise level, I found that this numerical error is multiplying itself rather severely. I am using CUDA 6.5.12 for my analysis.

Do you have any suggestion what I could do?

Hi Christian,

It's great that you can run R by CUDA and leverage GPU's power :)

And I'm sorry it is a little vague of data type in my code resulting a slight different results, and I will update the code later. Thanks for catching this point.

Actually, in my example, I used single precision cuFFT flag and API ( cufft_C2C / cufftExecC2C). For high precision calculations, you can use double precision API of cuFFT by following changes in the code:

type,singe precision, double precison

data define, cufftComplex, cufftDoubleComplex

flag, CUFFT_C2C, CUFFT_Z2Z

API, cufftExecC2C, cufftExecZ2Z

Refer: http://docs.nvidia.com/cuda...

So the results will be accurate :

> sum <- 4

> set.seed(5)

> z <- complex(real = stats::rnorm(num), imaginary = stats::rnorm(num))

> cpu <- fft(z)

> gpu <- cufft1D(z)

> sum(cpu)

[1] -3.363422+6.845763i

> sum(gpu)

[1] -3.363422+6.845763i

> cpu

[1] -0.6418452+0.0009952i 0.4470997+0.8693907i -3.5508495+2.4775538i

[4] 0.3821731+3.4978238i

> gpu

[1] -0.6418452+0.0009952i 0.4470997+0.8693907i -3.5508495+2.4775538i

[4] 0.3821731+3.4978238i

Enjoy CUDA and GPU programming!

Thanks,

-Patric

Hi Patric,

thanks a lot, that worked for me!

I am using FFT to convolute distribution functions, to receive something like a yearly loss distribution from several single losses that happened over the year. As can be shown, this convolution is quite tedious when done in the space of distributions, but rather easy when done by using the Fourier transformation.

What is interesting, that the GPU does not always beat the CPU (at least in my case) but only when I calculate the FFT on a very huge grid. I will play around with it to find out where the "break-even" lies, to further define rules for me when to use the CPU and when to use the GPU. However, CUDA adds a great deal to my performance, especially with the R-interface.

Kind regards,

Christian

Hi Christian,

Thanks for your explanations about the background and performance improvements of your application.

I suggest you to try BATCH mode of cuFFT for small size of computation if the data is independent.

"Execution of multiple 1D, 2D and 3D transforms simultaneously. These batched transforms have higher performance than single transforms."

Check out the document of cuFFT including an example for batch mode computation.

http://docs.nvidia.com/cuda...

Hope this can help for you.

BR,

-Patric

Hi Patric, I am getting this error "LIBCMTD.lib(crt0.obj) : error LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup" on VS2013 using CUDA 6.5. What could it be?

Thanks Patric. I am getting Error in .C("cufft", as.integer(n), as.integer(inverse), as.double(Re(z)), :

C symbol name "cufft" not in load table

It is strange since I am using "extern" as you indicated

Thanks for your interesting for our post :)

Actually, I was compiling and testing my code in Linux, so I am not very sure this problem in VS IDE. But I think you're correct to add 'extern "C"' keyword :)

Several high level suggestions to debug the problem:

0) save the source code with .c prefix (not .cpp)

1) ensure the compiled function (32bit/64bit) matched your R-installation

2) try to compile a pure C/C++ code (without CUDA code) and load in R

3) build pure CUDA code in VS (with main function to run and test)

4) finally, combine 2) and 3)

Hope this can help you. Feel free to let's know if you have further problems and questions in compiling or using CUDA in R.

Best Regards,

--Patric

Thank you very much Patric, really appreciate your help. I double checked my code in VS2013 in 64bits. Now, it works like a charm. I wasn't declaring properly the host function , as it should be for example 'extern "C" __declspec(dllexport) void vectorAdd(double* param){ '. My code is a .cu file, so there are not problems caused by the file type. By the way, the R function getLoadedDLLs() might be showing TRUE for our dll, but if we forgot to properly add "__declspec(dllexport)", we still wont be able to fire the functions as was happening to me (which naturally makes sense, since we are not flagging those functions to be exported)

Cool! Thanks for these tips of VS2013 in Windows.

hi patric. I need similar steps for windows. could you suggest me some good links for the same.

Hi Randy,

Thanks for your interest in our blog.

I have updated the blog for how to build R applications with CUDA by VS2013 on Windows. Hope it can help for you.

Thanks.

Hi Patric,

Can we expect the same at some point for Stata, either running under Windows or Unix?

Best regards,

Eric

Does Stata have a programming language with a foreign function interface? Does it rely on any standard math library interfaces (e.g. BLAS) where CUDA libraries could be used as a replacement?

Hi Patric, although this is 2 years ago posting, I really happy to see this article. Thank you! :D

Thank you for the interesting article and practical examples.

I would like to point out what I suspect are two small errors:

1) In Figure 2, the Cuda code for the FFT, the first line is an incomplete include statement. My guess would be that the line 1 should read:

"""

#include <r.h>

"""

(the R should be capital, Disqus is not rendering it as such on my end for some reason).

And the file compiles when this change is made.

2) In Figure 4, the Vector Add function, line 30, where the vector add kernel is being called, references variables "gridsize" and "blocksize", whereas the variables were defined with a capital S: "gridSize" and "blockSize".

Fixed, thanks!

Thanks for this detailed instruction.

I have a performance Issues with CUDA and R. I achieved a good performance when I use cublas for a matrix multiplication. It is about 6 seconds for MatrixA(10000,10000), MatrixB(10000,10000).

However, the performance is not good when I used the kernel function developed in NVIDIA SDK for a matrix multiplication. It is about 40 seconds for MatrixA(10000,10000), MatrixB(10000,10000).

I have no idea why it occurs. I really appreciate if you have any suggestions.