Hey,
We want to compute the arithmetic mean (average) and standard deviation of a large set of floating point numbers. The count of the numbers is NOT known at launch of kernel and it can vary from 0 to 20 million.
Is this problem parallelizable using CUDA?
Is it necessary to know the count of numbers before launching the kernel?
Here is what I do currently:
[codebox]
cudaStub(float* d_pElements)
{
// This uses parallel-scan technique to accumulate sum and count of the elements
launch kernel_SumCount<<<Blocks, Threads>>>(d_pElelements, d_pCountResults, d_pSumResults);
// all result arrays are as long as the number of blocks
// get results from device onto host
cudaMemcpy(h_pCountResults, d_pCountResults);
cudaMemcpy(h_pSumResults, d_pSumResults);
// initialize final count and sum
count = 0;
sum = 0;
// consolidate the results from kernel
for(i = 0; i < Blocks; i++)
{
count += h_pCountResults[i];
sum += h_pSumResults[i];
}
mean = sum / count; // Now I have the mean.
// Now I need to compute the variance. So another scan over the data using kernel similar to first one except it takes the mean as an argument.
launch kernel_Variance(d_pElements, d_pVarianceResults, mean);
// get results
cudaMemcpy(h_pVarianceResults, d_pVarianceResults);
tempVarianceSum = 0;
for(int i = 0; i < Blocks, i++)
{
tempVarianceSum += h_pVarianceResults[i];
}
variance = tempVarianceSum / count;
standardDeviation = squareRoot(variance);
[/codebox]
So in all I made 2 scans over the data plus the two for loops on the host side. For non-trivial datasets with about 3 million elements, each kernel takes about 10 msecs to run.
Is there a better way to compute these statistics?
Is there some existing algorithm in the parallel-computing world that I am not aware of?
I have not come across an existing algorithm yet. If someone knows of an existing technique please give some direction.
Any help would be appreciated. Thanks in anticipation!