Hey,

We want to compute the arithmetic mean (average) and standard deviation of a large set of floating point numbers. The count of the numbers is NOT known at launch of kernel and it can vary from 0 to 20 million.

Is this problem parallelizable using CUDA?

Is it necessary to know the count of numbers before launching the kernel?

Here is what I do currently:

[codebox]

cudaStub(float* d_pElements)

{

// This uses parallel-scan technique to accumulate sum and count of the elements

launch kernel_SumCount<<<Blocks, Threads>>>(d_pElelements, d_pCountResults, d_pSumResults);

// all result arrays are as long as the number of blocks

// get results from device onto host

cudaMemcpy(h_pCountResults, d_pCountResults);

cudaMemcpy(h_pSumResults, d_pSumResults);

// initialize final count and sum

count = 0;

sum = 0;

// consolidate the results from kernel

for(i = 0; i < Blocks; i++)

{

```
count += h_pCountResults[i];
sum += h_pSumResults[i];
```

}

mean = sum / count; // Now I have the mean.

// Now I need to compute the variance. So another scan over the data using kernel similar to first one except it takes the mean as an argument.

launch kernel_Variance(d_pElements, d_pVarianceResults, mean);

// get results

cudaMemcpy(h_pVarianceResults, d_pVarianceResults);

tempVarianceSum = 0;

for(int i = 0; i < Blocks, i++)

{

```
tempVarianceSum += h_pVarianceResults[i];
```

}

variance = tempVarianceSum / count;

standardDeviation = squareRoot(variance);

[/codebox]

So in all I made 2 scans over the data plus the two for loops on the host side. For non-trivial datasets with about 3 million elements, each kernel takes about 10 msecs to run.

Is there a better way to compute these statistics?

Is there some existing algorithm in the parallel-computing world that I am not aware of?

I have not come across an existing algorithm yet. If someone knows of an existing technique please give some direction.

Any help would be appreciated. Thanks in anticipation!