Accelerate R Applications with CUDA

jwitsoe · August 5, 2014, 2:14am

Originally published at: https://developer.nvidia.com/blog/accelerate-r-applications-cuda/

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through…

anon28907346 · August 6, 2014, 5:26pm

Thanks Patric, great overview!
We work in a Windows environment and have found very little compiled R application that can take advantage of CUDA. The matrix operations we do are highly divisible (perfect for CUDA) but most of what we do multi-threads over CPU cores.
Can you recommend an R solution that will allow us to parallel process with CUDA in a windows environment.
Thanks
Pat

anon11625365 · August 11, 2014, 2:17am

Hi Patrick,

Thanks for your comments.

In your case for matrix operations, I suggest you employ cublasxt ( https://developer.nvidia.co... ) and/or magma (https://developer.nvidia.co... library from R. To apply these libraries, you need to write up your own interface functions as this post described.

On windows, you can build your interface functions to DLL file and then still load by dyn.load() in R. One simple way is to use Visual Studio (IDE) to create/build a DLL project including your CUDA libraries call or CDUA C/C++/FORTRAN code. You can refer this document (http://docs.nvidia.com/cuda... to setup the development environment.

Hope this can help you!

-Patric

anon15186886 · October 31, 2014, 11:48am

Hi Patric,

Thank you very much for this nice tutorial!
I have a question especially regarding the FFT.

I was using your example:

num <- 4
set.seed(5)
z <- complex(real = stats::rnorm(num), imaginary = stats::rnorm(num))
cpu <- fft(z)
gpu <- cufft1D(z)

However, if I compute the sums, I get (very slightly) different results in the imaginary part:

> sum(cpu)
[1] -3.363422+6.845763i
> sum(gpu)
[1] -3.363422+6.845764i
It does not look severe, but since I am using FFT on a very precise level, I found that this numerical error is multiplying itself rather severely. I am using CUDA 6.5.12 for my analysis.

Do you have any suggestion what I could do?

anon11625365 · November 3, 2014, 8:18am

Hi Christian,

It's great that you can run R by CUDA and leverage GPU's power :)

And I'm sorry it is a little vague of data type in my code resulting a slight different results, and I will update the code later. Thanks for catching this point.

Actually, in my example, I used single precision cuFFT flag and API ( cufft_C2C / cufftExecC2C). For high precision calculations, you can use double precision API of cuFFT by following changes in the code:

type,singe precision, double precison
data define, cufftComplex, cufftDoubleComplex
flag, CUFFT_C2C, CUFFT_Z2Z
API, cufftExecC2C, cufftExecZ2Z

Refer: http://docs.nvidia.com/cuda...

So the results will be accurate :

> sum <- 4
> set.seed(5)
> z <- complex(real = stats::rnorm(num), imaginary = stats::rnorm(num))

> cpu <- fft(z)
> gpu <- cufft1D(z)

> sum(cpu)
[1] -3.363422+6.845763i
> sum(gpu)
[1] -3.363422+6.845763i

> cpu
[1] -0.6418452+0.0009952i 0.4470997+0.8693907i -3.5508495+2.4775538i
[4] 0.3821731+3.4978238i
> gpu
[1] -0.6418452+0.0009952i 0.4470997+0.8693907i -3.5508495+2.4775538i
[4] 0.3821731+3.4978238i

Enjoy CUDA and GPU programming!

Thanks,

-Patric

anon15186886 · November 3, 2014, 6:41pm

Hi Patric,

thanks a lot, that worked for me!

I am using FFT to convolute distribution functions, to receive something like a yearly loss distribution from several single losses that happened over the year. As can be shown, this convolution is quite tedious when done in the space of distributions, but rather easy when done by using the Fourier transformation.

What is interesting, that the GPU does not always beat the CPU (at least in my case) but only when I calculate the FFT on a very huge grid. I will play around with it to find out where the "break-even" lies, to further define rules for me when to use the CPU and when to use the GPU. However, CUDA adds a great deal to my performance, especially with the R-interface.

Kind regards,
Christian

anon11625365 · November 4, 2014, 9:12am

Hi Christian,

Thanks for your explanations about the background and performance improvements of your application.

I suggest you to try BATCH mode of cuFFT for small size of computation if the data is independent.

"Execution of multiple 1D, 2D and 3D transforms simultaneously. These batched transforms have higher performance than single transforms."

Check out the document of cuFFT including an example for batch mode computation.

http://docs.nvidia.com/cuda...

Hope this can help for you.

BR,

-Patric

anon51415268 · November 21, 2014, 11:10am

Hi Patric, I am getting this error "LIBCMTD.lib(crt0.obj) : error LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup" on VS2013 using CUDA 6.5. What could it be?

anon51415268 · November 21, 2014, 8:00pm

Thanks Patric. I am getting Error in .C("cufft", as.integer(n), as.integer(inverse), as.double(Re(z)), :
C symbol name "cufft" not in load table

It is strange since I am using "extern" as you indicated

anon11625365 · November 24, 2014, 4:10am

Thanks for your interesting for our post :)

Actually, I was compiling and testing my code in Linux, so I am not very sure this problem in VS IDE. But I think you're correct to add 'extern "C"' keyword :)

Several high level suggestions to debug the problem:

0) save the source code with .c prefix (not .cpp)
1) ensure the compiled function (32bit/64bit) matched your R-installation

2) try to compile a pure C/C++ code (without CUDA code) and load in R
3) build pure CUDA code in VS (with main function to run and test)
4) finally, combine 2) and 3)

Hope this can help you. Feel free to let's know if you have further problems and questions in compiling or using CUDA in R.

Best Regards,

--Patric

anon96242813 · November 26, 2014, 1:54am

Thank you very much Patric, really appreciate your help. I double checked my code in VS2013 in 64bits. Now, it works like a charm. I wasn't declaring properly the host function , as it should be for example 'extern "C" __declspec(dllexport) void vectorAdd(double* param){ '. My code is a .cu file, so there are not problems caused by the file type. By the way, the R function getLoadedDLLs() might be showing TRUE for our dll, but if we forgot to properly add "__declspec(dllexport)", we still wont be able to fire the functions as was happening to me (which naturally makes sense, since we are not flagging those functions to be exported)

anon11625365 · November 26, 2014, 2:30am

Cool! Thanks for these tips of VS2013 in Windows.

anon81395761 · January 21, 2015, 7:13am

hi patric. I need similar steps for windows. could you suggest me some good links for the same.

anon52312825 · January 29, 2015, 2:04am

Hi Randy,

Thanks for your interest in our blog.

I have updated the blog for how to build R applications with CUDA by VS2013 on Windows. Hope it can help for you.

Thanks.

anon4060875 · February 1, 2015, 12:40pm

Hi Patric,

Can we expect the same at some point for Stata, either running under Windows or Unix?

Best regards,
Eric

anon95180265 · February 2, 2015, 5:21am

Does Stata have a programming language with a foreign function interface? Does it rely on any standard math library interfaces (e.g. BLAS) where CUDA libraries could be used as a replacement?

anon73527512 · April 10, 2017, 5:06am

Hi Patric, although this is 2 years ago posting, I really happy to see this article. Thank you! :D

anon7954126 · October 29, 2018, 3:12pm

Thank you for the interesting article and practical examples.

I would like to point out what I suspect are two small errors:

1) In Figure 2, the Cuda code for the FFT, the first line is an incomplete include statement. My guess would be that the line 1 should read:
"""
#include <r.h>
"""
(the R should be capital, Disqus is not rendering it as such on my end for some reason).
And the file compiles when this change is made.

2) In Figure 4, the Vector Add function, line 30, where the vector add kernel is being called, references variables "gridsize" and "blocksize", whereas the variables were defined with a capital S: "gridSize" and "blockSize".

anon95180265 · October 29, 2018, 8:21pm

Fixed, thanks!

anon14752876 · December 3, 2018, 3:14am

Thanks for this detailed instruction.
I have a performance Issues with CUDA and R. I achieved a good performance when I use cublas for a matrix multiplication. It is about 6 seconds for MatrixA(10000,10000), MatrixB(10000,10000).
However, the performance is not good when I used the kernel function developed in NVIDIA SDK for a matrix multiplication. It is about 40 seconds for MatrixA(10000,10000), MatrixB(10000,10000).
I have no idea why it occurs. I really appreciate if you have any suggestions.

Topic		Replies	Views
VST - CUDA integration CUDA Programming and Performance	16	19942	April 29, 2010
New Features in CUDA 7.5 Technical Blog	66	1092	August 10, 2016
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1665	October 12, 2021
cuFFT and fftw CUDA Programming and Performance	10	4174	August 25, 2010
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241057	November 16, 2011
CUDA FFT different from Matlab FFT CUDA Programming and Performance	32	9336	March 29, 2011
CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing Technical Blog	18	1213	August 15, 2023
NukadaFFT library CUDA Programming and Performance	128	123468	February 6, 2012
How can I compile CUDA code then link it to a C++/CLR project CUDA Programming and Performance	21	13051	August 21, 2017
cufftXt batch 1D GPU-Accelerated Libraries	12	2181	October 15, 2019

Accelerate R Applications with CUDA

Related topics