SAXPY optimization? forward propagation CUDA optimization?

Do you have a suggestion on how to best implement a forward propagation method with CUDA?

I would like to know if there is something more efficient than using BLAS-SAXPY in a loop, or at least the best approach to setting up a BLAS-SAXPY loop in CUDA.

General Method:

for (uint inDX = 0; inDX < m_sInputs; inDX++)
{
    tNetUnit* __restrict pdWt = pdWeights + (inDX*m_sOutputs);

    // SAXPY
    for (uint outDX = 0; outDX < m_sOutputs; outDX++)
    {
            m_pdOutNodes[outDX] += pdWt[outDX] * m_pdInNodes[inDX];
    }
}

I currently use a CPU SSE as follows :

uint   xM = m;
float *xR = r;

for (uint j = 0; j < n; j++)
{
    m = xM;
    r = xR;

    switch(((unsigned int)r)&15)                                                     
    {                                                                               
       case  4: (*r)+=(*a)*(*B);r[1]+=a[1]*(*B);r[2]+=a[2]*(*B); r+=3; a+=3; m-=3; break;       
       case  8: (*r)+=(*a)*(*B);r[1]+=a[1]*(*B);                 r+=2; a+=2; m-=2; break;         
       case 12: (*r)+=(*a)*(*B);                                 r++;  a++;  m--;  break;       
    }

    float Xb = *b++;

    if (((unsigned int)a)&15)
    {
        SAXPYU_SSE (A,m,Xb,a,r); // BLAS SAXPY unaligned
    }
    else
    {
        SAXPY_SSE (B,m,Xb,a,r);   // BLAS SAXPY aligned
    } 

    a += m;
}

return a;

Any and all help is greatly appreciated ! ! !

Thanks,
tim