Do you have a suggestion on how to best implement a forward propagation method with CUDA?
I would like to know if there is something more efficient than using BLAS-SAXPY in a loop, or at least the best approach to setting up a BLAS-SAXPY loop in CUDA.
General Method:
for (uint inDX = 0; inDX < m_sInputs; inDX++)
{
tNetUnit* __restrict pdWt = pdWeights + (inDX*m_sOutputs);
// SAXPY
for (uint outDX = 0; outDX < m_sOutputs; outDX++)
{
m_pdOutNodes[outDX] += pdWt[outDX] * m_pdInNodes[inDX];
}
}
I currently use a CPU SSE as follows :
uint xM = m;
float *xR = r;
for (uint j = 0; j < n; j++)
{
m = xM;
r = xR;
switch(((unsigned int)r)&15)
{
case 4: (*r)+=(*a)*(*B);r[1]+=a[1]*(*B);r[2]+=a[2]*(*B); r+=3; a+=3; m-=3; break;
case 8: (*r)+=(*a)*(*B);r[1]+=a[1]*(*B); r+=2; a+=2; m-=2; break;
case 12: (*r)+=(*a)*(*B); r++; a++; m--; break;
}
float Xb = *b++;
if (((unsigned int)a)&15)
{
SAXPYU_SSE (A,m,Xb,a,r); // BLAS SAXPY unaligned
}
else
{
SAXPY_SSE (B,m,Xb,a,r); // BLAS SAXPY aligned
}
a += m;
}
return a;
Any and all help is greatly appreciated ! ! !
Thanks,
tim