Hello,
after upgrading from 21.2, I get an obscure compiler error, with code that has been working for quite a long while (PGI 19.10 to 21.2, with no issues AFAIK):
NVC++-F-0000-Internal compiler error. mr_precedes: too many st/br 31
The compiler then reports a line that is actually the closing }
of a particular function that has some OpenACC regions. I am not the developer of the application so I don’t know the exact rationale of the function design, but I will provide the code here. If I comment out the acc pragmas in this particular one (the application has multiple computational OpenACC regions, async etc. that don’t seem to get in the way) compilation continues, until it breaks again in another function, very similar to this one. As the exact same code compiles and works with versions up to 21.2 I suspect there has been some new behavior in 21.3 that introduces this bug. The whole program uses the Unified Memory feature, and that’s why you’ll notice quite some pointer chasing on the device code (as it’s adapted from a non-heterogeneous version).
Function code
Compiler reports:
NVC++-F-0000-Internal compiler error. mr_precedes: too many st/br 31 (kernel_particles.c: 330)
So here is the corresponding part of kernel_particles.c
// Organize the particles in tiles (Bucket Sort)
void spec_organize_in_tiles(t_species *spec, const int limits_y[2], const int device)
{
int iy, ix;
const int size = spec->main_vector.size;
const int n_tiles_x = spec->n_tiles_x;
const int n_tiles_y = spec->n_tiles_y;
int *restrict tile_offset = spec->tile_offset;
int *restrict pos = alloc_align_buffer(DEFAULT_ALIGNMENT, size * sizeof(int));
#ifdef ENABLE_PREFETCH
spec_prefetch_openacc(&spec->main_vector, device, NULL);
cudaMemPrefetchAsync(spec->tile_offset, (n_tiles_x * n_tiles_y + 1) * sizeof(int), device, NULL);
cudaMemPrefetchAsync(pos, size * sizeof(int), device, NULL);
#endif
// Calculate the histogram (number of particles per tile)
#pragma acc parallel loop private(ix, iy)
for (int i = 0; i < size; i++)
{
ix = spec->main_vector.ix[i] / TILE_SIZE;
iy = (spec->main_vector.iy[i] - limits_y[0]) / TILE_SIZE;
#pragma acc atomic capture
pos[i] = tile_offset[ix + iy * n_tiles_x]++;
}
// Prefix sum to find the initial idx of each tile in the particle vector
prefix_sum_openacc(tile_offset, n_tiles_x * n_tiles_y + 1);
// Calculate the target position of each particle
#pragma acc parallel loop private(ix, iy)
for (int i = 0; i < size; i++)
{
ix = spec->main_vector.ix[i] / TILE_SIZE;
iy = (spec->main_vector.iy[i] - limits_y[0]) / TILE_SIZE;
pos[i] += tile_offset[ix + iy * n_tiles_x];
}
const int final_size = tile_offset[n_tiles_x * n_tiles_y];
spec->main_vector.size = final_size;
// Move the particles to the correct position
spec_move_vector_int_full(spec->main_vector.ix, pos, size);
spec_move_vector_int_full(spec->main_vector.iy, pos, size);
spec_move_vector_float_full(spec->main_vector.x, pos, size);
spec_move_vector_float_full(spec->main_vector.y, pos, size);
spec_move_vector_float_full(spec->main_vector.ux, pos, size);
spec_move_vector_float_full(spec->main_vector.uy, pos, size);
spec_move_vector_float_full(spec->main_vector.uz, pos, size);
// Validate all the particles
#pragma acc parallel loop
for (int k = 0; k < final_size; k++)
spec->main_vector.invalid[k] = false;
free_align_buffer(pos); // Clean position vector
} // This is line 330 from the compiler error
As I can’t decipher the compiler error, I thought it’d be interesting for someone to take a look. There’s obviously some context (definitions etc.) missing, but as this was working till now maybe it’s not even relevant. I’ll fill in those informational gaps as needed.
Thanks!