Hi,
I’m currently working on integrating CUDA for the statistical language R. For a proofofconcept I’m porting one standard R function to CUDA, namely the density normal distrubution (called dnorm() in R). The density is calculated with this formula (x = input, mu = mean, sigma = standard deviation):
f(x) = 1/(sqrt(2 pi) sigma) e^((x  mu)^2/(2 sigma^2))
I have the function perfectly working (and callable from within R) and getting an average of 5x boost comparing CPU vs. GPU (quadcore 2.4Ghz Core2 [R only uses one core] with 8800GTX). The biggest bottleneck is that I need to cast all input/output from doubleprecision to singleprecision since R only uses doubleprecision. So for optimizing this algorithm, I have a few questions:
 Is casting the input/output using a for() loop (on the CPU) to iterate on all values the fastest method? eg.:
for (i = 0; i < N; i++)
{
output_float[i] = (float) input_double[i];
}
 Since this CUDA function only uses the input value once, do I need to use shared memory?
I’ve read somewhere that even when using the input only once, using shared memory might speed it up (couldn’t notice it myself though).

Do I need to use specialized CUDA functions for multiplying, division, power and exponent for the best performance?

Can I keep the input in one vector for all input sizes? If I use a vector of say 10million elements, is it still the most optimal datastructure to use or should it be changed or split up?
I’ve used the CUDA GPU Occupancy Calculator with the following numbers (when using shared memory):

Threads Per Block: 256

Registers Per Thread: 4

Shared Memory Per Block: 1064
(information is from the .cubin file)
The Multiprocessor Warp Occupancy is 24 according to this calculator, which is the Max Occupancy.
The kernel code I currently use (without shared mem usage) is this one:
__global__ void deviceDnorm(float* a, float* o, float m, float s) {
int tid = threadIdx.x;
int bid = blockIdx.x;
int index = bid * blockDim.x + tid;
o[index] = (1 / (sqrtf(2 * M_PI))) * exp( ( (__powf(( a[index]  m), 2)) / (2 * __powf(s, 2)) ));
}
Any advice on (further) optimizations are very much appreciated and welcome!
 Marcel.