Computing mean and standard deviation in parallel Cna we extend Parallel Reduction?


We want to compute the arithmetic mean (average) and standard deviation of a large set of floating point numbers. The count of the numbers is NOT known at launch of kernel and it can vary from 0 to 20 million.

Is this problem parallelizable using CUDA?

Is it necessary to know the count of numbers before launching the kernel?

Here is what I do currently:


cudaStub(float* d_pElements)


// This uses parallel-scan technique to accumulate sum and count of the elements

launch kernel_SumCount<<<Blocks, Threads>>>(d_pElelements, d_pCountResults, d_pSumResults);

// all result arrays are as long as the number of blocks

// get results from device onto host

cudaMemcpy(h_pCountResults, d_pCountResults);

cudaMemcpy(h_pSumResults, d_pSumResults);

// initialize final count and sum

count = 0;

sum = 0;

// consolidate the results from kernel

for(i = 0; i < Blocks; i++)


 count += h_pCountResults[i];

 sum += h_pSumResults[i];


mean = sum / count; // Now I have the mean.

// Now I need to compute the variance. So another scan over the data using kernel similar to first one except it takes the mean as an argument.

launch kernel_Variance(d_pElements, d_pVarianceResults, mean);

// get results

cudaMemcpy(h_pVarianceResults, d_pVarianceResults);

tempVarianceSum = 0;

for(int i = 0; i < Blocks, i++)


tempVarianceSum += h_pVarianceResults[i];


variance = tempVarianceSum / count;

standardDeviation = squareRoot(variance);


So in all I made 2 scans over the data plus the two for loops on the host side. For non-trivial datasets with about 3 million elements, each kernel takes about 10 msecs to run.

Is there a better way to compute these statistics?

Is there some existing algorithm in the parallel-computing world that I am not aware of?

I have not come across an existing algorithm yet. If someone knows of an existing technique please give some direction.

Any help would be appreciated. Thanks in anticipation!

Yes, there is a better way. First, you should be calculating the standard deviation in a single pass. Here is how you do it:

foreach v in array







Now, the process of calculating the standard deviation is just a reduction. See the SDK for an optimized reduction operation.

Will this be any faster? That depends. If you have to transfer the entire array onto the GPU, calculate, then return the answer, you might not be saving any time. You can at least implement the one pass standard deviation calculation to speed up the host code.

Thanks ‘ctierney42’ !
That should reduce work in half (atleast). :thumbup:

So I must ask :rolleyes: - Is there a way to compute the median of a large set of floating points in parallel?

Thanks again!

Do a parallel sort and just have the CPU grab the middle value?

You can check out the Radix sort SDK example for sorting…

In principle, one can find the median of a list in linear time, as this is a specific case of the general selection problem, in which the goal is to find the kth smallest element in a list. (For the median, k = n/2) It’s not clear whether these algorithms could be implemented efficiently on the GPU, as they tend to look kind of like quicksort.

If performance is not the highest priority here, the suggestion to sort the list with some standard CUDA algorithm like radix sort is best. However, if you find you need the median calculation to go faster, take a look at this page for some interesting ideas on linear time algorithms:

Going that route, it might also be helpful to checkout this paper on optimizing qsort on the GPU:

Be careful of that accumulator method of computing mean and standard deviation. Its accuracy will decay quickly after about 1M points because of floating point precision.
After you’ve summed a million values, you’re now adding tiny offsets only 1/1000000 as large as the sum, which is WORST CASE for floating point.

Using double precision is an easy solution.

For single precision, a simple hack works pretty well. Accumulate 500 values or so. Then add that accumulated sum to a “master accumulator” and reset your base accumulator back to 0.0.
This isn’t ideal for preserving every bit of entropy but it’s significantly better (by several orders of magnitude) than a single accumulator variable.

Fortunately, the standard tree reduction pattern on the GPU does this automatically…