CPU works faster than GPU - what's done wrong ?


As a first test of CUDA I decided to create an app that would calculate simple moving average. Finally, I’ve got the code that works and computes correct results, but GPU part works 10 times slower than simple CPU computation. Looks like my code is not suitable. Please, help me understand what I’m doing wrong.

I’ve attached the solution for VC8, also providing key code fragments here:

This is how the ‘gold’ routine looks like:
extern “C” void SMAGold(float* pOut, float* pIn, float fPeriod, long lSize)
long lPeriod = long(fPeriod);

float fSum = 0.0;
for (int i = 0; i < lPeriod; i++)
	pOut[i] = ((fSum = fSum + pIn[i]) / float(i + 1));

for (int i = lPeriod; i < lSize; i++)
	pOut[i] = ((fSum = fSum - pIn[i - lPeriod] + pIn[i]) / float(lPeriod));

Just an ordinary simple MA calculation.

This is the cuda implementation:
const int nNumThreadsInBlock = blockDim.x * blockDim.y * blockDim.z;
const int nNumBlocksInGrid = gridDim.x * gridDim.y;
const int nNumThreads = nNumBlocksInGrid * nNumThreadsInBlock;
const int nBlockIDInGrid = blockIdx.x + blockIdx.y * gridDim.x;
const int nThreadIDInBlock = threadIdx.x + threadIdx.y * blockDim.x +
threadIdx.z * blockDim.x * blockDim.y;
const int nThreadID = nBlockIDInGrid * nNumThreadsInBlock + nThreadIDInBlock;

global void SMACudaKernel(float* pOut, float* pIn, int nPeriod, long lSize)
//_tprintf(_T(“Thread ID: %d, Total num threads in grid: %d\n”), nThreadID, nNumThreads);

for (int nBar = nThreadID; nBar < lSize; nBar += nNumThreads)
	int nStartBar = (nBar < nPeriod ? 0 : nBar - nPeriod + 1);

	float fSum = 0.0f;
	for (int i = nStartBar; i <= nBar; i++)
		fSum += pIn[i];
	pOut[nBar] = fSum / float(nBar - nStartBar + 1);


extern “C” void SMACuda(float* pOut, float* pIn, float fPeriod, long lSize)
float* d_pIn = NULL;
float* d_pOut = NULL;

CUDA_SAFE_CALL(cudaMalloc((void**)&d_pIn, lSize * sizeof(float)));
CUDA_SAFE_CALL(cudaMalloc((void**)&d_pOut, lSize * sizeof(float)));
CUDA_SAFE_CALL(cudaMemcpy(d_pIn, pIn, lSize * sizeof(float), cudaMemcpyHostToDevice));

unsigned int hTimer;

SMACudaKernel<<<512, 512>>>(d_pOut, d_pIn, fPeriod, lSize);
//SMACudaKernel<<<int(lSize / 512) + 1, 512>>>(d_pOut, d_pIn, fPeriod, lSize);

double dGPUPureTime = cutGetTimerValue(hTimer);
_tprintf(_T("GPU pure time: %f msec\n"), dGPUPureTime);

CUDA_SAFE_CALL(cudaMemcpy(pOut, d_pOut, lSize * sizeof(float), cudaMemcpyDeviceToHost));


And finally, this is the result:
Running gold on CPU …
CPU time: 91.325577 msec

Running kernel on GPU …
GPU pure time: 943.364319 msec
GPU full time: 1043.030884 msec

Comparing results …
L1 norm: 2.292061E-005
Max absolute error: 2.677917E-003

I’d really appreciate any help. Looks like I’m doing something very nonoptimal, but I don’t understand the reason.

Thanks in advance.
MovingAverages.zip (873 KB)

You GPU is far more faster,
because according you code
it calculates the movavg
serveral times per time of measure.
Your Cuda code hat two inner loops as the
gold code has.
Only one Loop is necessary,
the outer loop is done by the CUDA driver/SDK.
Calculating multiple times with CUDA :wave:
does not make wrong results though.

Hi ZeZe,

Looks like I’ve found the reason - but it is not due to the outer loop that I duplicate as you say. Such an outer loop is taken from the Black-Scholes CUDA sample and it works perfectly as it should work.

The difference is in the way the gold version and cuda version are computed - cuda algorithm is simply inoptimal as it takes all the source data items inside the period of MA into consideration for each nBar. Gold version does not recalculate the sum of elements on each bar - that’s the key difference.

However, I’ve created another sample that calculates adaptive MA (with variable period) - and, definitely, I’ve started to understand the reason why the CUDA has been created and what for :-)

I am working on coding up a moving average for CUDA implemenation and am very curious to see where you ended up on this issue, intuitively this seems to be an inherently serial process.

Thanks in advance.

A moving average is just a convolution with a filter kernel. The slight uniqueness is that the kernel is all one sided (entirely stretching to the left).

A convolution can be computed in parallel in several ways… an FFT method comes to mind immediately.

But even keeping it into the time domain is still parallizable. One idea would be to build summed table (using a prefix sum algorithm in parallel) then you can read off an (unweighted) average for any region you like with two reads, a subtract, and a divide. [Overflow/precision issues may make you split your region into independent parts though.] Weighted averages can also be done with a little more bookkeeping.

Finally, if your moving average is narrow, you could just use the serial algorithm in parallel, and overlap the work ranges… if your MA had a width of 10, you might have thread 0 compute the MA for time 0 to 100. Thread 1 would start at time 90 but compute to time 200 (and not report those first 10 samples as it “warmed up.”), etc.

Anyway, there are lots of strategies to parallelize this.