Hi,
I’m currently working on integrating CUDA for the statistical language R. For a proof-of-concept I’m porting one standard R function to CUDA, namely the density normal distrubution (called dnorm() in R). The density is calculated with this formula (x = input, mu = mean, sigma = standard deviation):
f(x) = 1/(sqrt(2 pi) sigma) e^-((x - mu)^2/(2 sigma^2))
I have the function perfectly working (and callable from within R) and getting an average of 5x boost comparing CPU vs. GPU (quad-core 2.4Ghz Core2 [R only uses one core] with 8800GTX). The biggest bottleneck is that I need to cast all input/output from double-precision to single-precision since R only uses double-precision. So for optimizing this algorithm, I have a few questions:
- Is casting the input/output using a for() loop (on the CPU) to iterate on all values the fastest method? eg.:
for (i = 0; i < N; i++)
{
output_float[i] = (float) input_double[i];
}
- Since this CUDA function only uses the input value once, do I need to use shared memory?
I’ve read somewhere that even when using the input only once, using shared memory might speed it up (couldn’t notice it myself though).
-
Do I need to use specialized CUDA functions for multiplying, division, power and exponent for the best performance?
-
Can I keep the input in one vector for all input sizes? If I use a vector of say 10million elements, is it still the most optimal data-structure to use or should it be changed or split up?
I’ve used the CUDA GPU Occupancy Calculator with the following numbers (when using shared memory):
-
Threads Per Block: 256
-
Registers Per Thread: 4
-
Shared Memory Per Block: 1064
(information is from the .cubin file)
The Multiprocessor Warp Occupancy is 24 according to this calculator, which is the Max Occupancy.
The kernel code I currently use (without shared mem usage) is this one:
__global__ void deviceDnorm(float* a, float* o, float m, float s) {
int tid = threadIdx.x;
int bid = blockIdx.x;
int index = bid * blockDim.x + tid;
o[index] = (1 / (sqrtf(2 * M_PI))) * exp( -( (__powf(( a[index] - m), 2)) / (2 * __powf(s, 2)) ));
}
Any advice on (further) optimizations are very much appreciated and welcome!
- Marcel.