(I’m using PGi 12.8 on Linux 64, with a GeForce GTX 280 and CUDA 4.1)
I’m doing some experiments with OpenACC, and this is puzzling me:
I had the following code to perform matrix multiplications:
typedef float ff;
void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{
#pragma acc region
{
#pragma acc loop independent vector(16)
for (int i = 0; i < n; ++i) {
#pragma acc loop independent vector(16)
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
}
}
}
This code runs well, but I’m looking to optimize it.
I then do a small transformation:
void mmul(const restrict ff* a,
const restrict ff* b,
restrict ff* c,
const int n) {
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n])
{
#pragma acc region
{
for (int is = 0; is < n; is += 32) {
#pragma acc loop independent
for (int i = is; i < is+32; ++i) {
#pragma acc loop independent
for (int j = 0; j < n; ++j) {
ff sum = 0;
for (int k = 0; k < n; ++k) {
sum += a[i + n * k] * b[k + n * j];
}
c[i + n * j] = sum;
}
}
}
}
}
}
I simply added an external for loop, but the iteration remains basically the same.
While this isn’t by itself an optimization, the result is very strange: about half of the times I run this code I get the following error:
call to ctxSynchronize/after/__pgi_cu_uploadx returned error 702: Launch timeout
The other half of the times it simply runs, in about 8 seconds (for 1024x1024 sized matrices).
For smaller matrices it always works, so I suppose there might be a timeout issue here.
I’m not worried with the performance here, but want to understand this strange behaviour.