Hi,
I am trying to generate good SSE code using the PGI compiler. I am running into issues. The compiler is refusing to generate SSE code for a block of code such as this.
for (t4 = 0; t4 <= 14; t4++)
{
#pragma ivdep
// #pragma vector aligned
z[0] = z[0] + A[t4 * 15 + 0] * x[t4];
z[0 + 1] = z[0 + 1] + A[t4 * 15 + 0 + 1] * x[t4];
z[0 + 2] = z[0 + 2] + A[t4 * 15 + 0 + 2] * x[t4];
z[0 + 3] = z[0 + 3] + A[t4 * 15 + 0 + 3] * x[t4];
z[0 + 4] = z[0 + 4] + A[t4 * 15 + 0 + 4] * x[t4];
z[0 + 5] = z[0 + 5] + A[t4 * 15 + 0 + 5] * x[t4];
z[0 + 6] = z[0 + 6] + A[t4 * 15 + 0 + 6] * x[t4];
z[0 + 7] = z[0 + 7] + A[t4 * 15 + 0 + 7] * x[t4];
z[0 + 8] = z[0 + 8] + A[t4 * 15 + 0 + 8] * x[t4];
z[0 + 9] = z[0 + 9] + A[t4 * 15 + 0 + 9] * x[t4];
z[0 + 10] = z[0 + 10] + A[t4 * 15 + 0 + 10] * x[t4];
z[0 + 11] = z[0 + 11] + A[t4 * 15 + 0 + 11] * x[t4];
z[0 + 12] = z[0 + 12] + A[t4 * 15 + 0 + 12] * x[t4];
z[0 + 13] = z[0 + 13] + A[t4 * 15 + 0 + 13] * x[t4];
z[0 + 14] = z[0 + 14] + A[t4 * 15 + 0 + 14] * x[t4];
}
I can re roll the entire block to form a loop. When I do this, the compiler unrolls the loop and vectorizes it but uses only two SSE registers which restricts the instruction level parallelism, Is there a way to get around this ? The block contains a lot of independent instructions perfect for SSE.
Thanks,
Shreyas