optimization tips for 3D elementwise matrix multiply

Hi,

I am writing a 3D elementwise matrix multiply (and add) kernel, is there any optimization tip I should be awared of except for the basic implementation:

if (idx < nelements) C[idx] += A[idx]*B[idx]

Thanks,
zhmukc