 # SAXPY optimization? forward propagation CUDA optimization?

Do you have a suggestion on how to best implement a forward propagation method with CUDA?

I would like to know if there is something more efficient than using BLAS-SAXPY in a loop, or at least the best approach to setting up a BLAS-SAXPY loop in CUDA.

General Method:

``````for (uint inDX = 0; inDX < m_sInputs; inDX++)
{
tNetUnit* __restrict pdWt = pdWeights + (inDX*m_sOutputs);

// SAXPY
for (uint outDX = 0; outDX < m_sOutputs; outDX++)
{
m_pdOutNodes[outDX] += pdWt[outDX] * m_pdInNodes[inDX];
}
}
``````

I currently use a CPU SSE as follows :

``````uint   xM = m;
float *xR = r;

for (uint j = 0; j < n; j++)
{
m = xM;
r = xR;

switch(((unsigned int)r)&15)
{
case  4: (*r)+=(*a)*(*B);r+=a*(*B);r+=a*(*B); r+=3; a+=3; m-=3; break;
case  8: (*r)+=(*a)*(*B);r+=a*(*B);                 r+=2; a+=2; m-=2; break;
case 12: (*r)+=(*a)*(*B);                                 r++;  a++;  m--;  break;
}

float Xb = *b++;

if (((unsigned int)a)&15)
{
SAXPYU_SSE (A,m,Xb,a,r); // BLAS SAXPY unaligned
}
else
{
SAXPY_SSE (B,m,Xb,a,r);   // BLAS SAXPY aligned
}

a += m;
}

return a;
``````

Any and all help is greatly appreciated ! ! !

Thanks,
tim