Compiler error with 21.3 and OpenACC

Hello,

after upgrading from 21.2, I get an obscure compiler error, with code that has been working for quite a long while (PGI 19.10 to 21.2, with no issues AFAIK):

NVC++-F-0000-Internal compiler error. mr_precedes: too many st/br 31

The compiler then reports a line that is actually the closing } of a particular function that has some OpenACC regions. I am not the developer of the application so I don’t know the exact rationale of the function design, but I will provide the code here. If I comment out the acc pragmas in this particular one (the application has multiple computational OpenACC regions, async etc. that don’t seem to get in the way) compilation continues, until it breaks again in another function, very similar to this one. As the exact same code compiles and works with versions up to 21.2 I suspect there has been some new behavior in 21.3 that introduces this bug. The whole program uses the Unified Memory feature, and that’s why you’ll notice quite some pointer chasing on the device code (as it’s adapted from a non-heterogeneous version).

Function code

Compiler reports:
NVC++-F-0000-Internal compiler error. mr_precedes: too many st/br 31 (kernel_particles.c: 330)

So here is the corresponding part of kernel_particles.c

// Organize the particles in tiles (Bucket Sort)
void spec_organize_in_tiles(t_species *spec, const int limits_y[2], const int device)
{
	int iy, ix;

	const int size = spec->main_vector.size;
	const int n_tiles_x = spec->n_tiles_x;
	const int n_tiles_y = spec->n_tiles_y;

	int *restrict tile_offset = spec->tile_offset;
	int *restrict pos = alloc_align_buffer(DEFAULT_ALIGNMENT, size * sizeof(int));

#ifdef ENABLE_PREFETCH
		spec_prefetch_openacc(&spec->main_vector, device, NULL);
		cudaMemPrefetchAsync(spec->tile_offset, (n_tiles_x * n_tiles_y + 1) * sizeof(int), device, NULL);
		cudaMemPrefetchAsync(pos, size * sizeof(int), device, NULL);
#endif

	// Calculate the histogram (number of particles per tile)
	#pragma acc parallel loop private(ix, iy)
	for (int i = 0; i < size; i++)
	{
		ix = spec->main_vector.ix[i] / TILE_SIZE;
		iy = (spec->main_vector.iy[i] - limits_y[0]) / TILE_SIZE;

		#pragma acc atomic capture
		pos[i] = tile_offset[ix + iy * n_tiles_x]++;
	}

	// Prefix sum to find the initial idx of each tile in the particle vector
	prefix_sum_openacc(tile_offset, n_tiles_x * n_tiles_y + 1);

	// Calculate the target position of each particle
	#pragma acc parallel loop private(ix, iy)
	for (int i = 0; i < size; i++)
	{
		ix = spec->main_vector.ix[i] / TILE_SIZE;
		iy = (spec->main_vector.iy[i] - limits_y[0]) / TILE_SIZE;

		pos[i] += tile_offset[ix + iy * n_tiles_x];
	}

	const int final_size = tile_offset[n_tiles_x * n_tiles_y];
	spec->main_vector.size = final_size;

	// Move the particles to the correct position
	spec_move_vector_int_full(spec->main_vector.ix, pos, size);
	spec_move_vector_int_full(spec->main_vector.iy, pos, size);
	spec_move_vector_float_full(spec->main_vector.x, pos, size);
	spec_move_vector_float_full(spec->main_vector.y, pos, size);
	spec_move_vector_float_full(spec->main_vector.ux, pos, size);
	spec_move_vector_float_full(spec->main_vector.uy, pos, size);
	spec_move_vector_float_full(spec->main_vector.uz, pos, size);

	// Validate all the particles
		#pragma acc parallel loop
		for (int k = 0; k < final_size; k++)
			spec->main_vector.invalid[k] = false;

	free_align_buffer(pos);  // Clean position vector
} // This is line 330 from the compiler error

As I can’t decipher the compiler error, I thought it’d be interesting for someone to take a look. There’s obviously some context (definitions etc.) missing, but as this was working till now maybe it’s not even relevant. I’ll fill in those informational gaps as needed.
Thanks!

It’s an internal compiler error (ICE) so a compiler issue, not an issue with the code.

Would you be able to provide a small but complete reproducing example that recreates the issue? Unfortunately, the code snip-it you provide is missing definitions so I can’t compile it. Once I have a reproducer, I’ll send a report to our compiler engineers so we can get the issue resolved.

Thanks,
Mat

Hi Mat, thanks for answering.

I tried writing a dummy (with pointer chasing and assignments in a loop with private locals) before posting but I don’t know what it is that triggers the bug -and it certainly doesn’t look obvious- so I was not able to reproduce it in a PoC. Any ideas?

Maybe ‘too many st/br’ may indicate (to you that can hopefully find out what it means) what we need to reproduce it?

In the meantime, the full code is actually open and accessible:

In the master branch, go to parallel/openacc_multigpu/ and simply make.

Cheers
Orestis

Thanks Orestis. I was able to reproduce the issue here and have filed a problem report (TPR #29945).

Looks to be an issue with our new atomic support in 21.3 (I’ve seen a few other codes with similar issues). You can work around it by adding the internal compiler flag “-Mx,231,0x01” to revert to using the older atomics. Though be sure to remove this flag later once we fix this problem.

-Mat

Hi Orestis,

Engineering let me know that this issue, TPR #29945, has been fixed in the 21.5 release.

Example:

Fails with 21.3:

% pgcc -c kernel_particles.c -o kernel_particles.o -O3 -fast -cudalibs -DTEST -Iinclude/ -lm -acc -gpu=managed -V21.3 -w
NVC++-F-0000-Internal compiler error. mr_precedes: too many st/br      31  (kernel_particles.c: 330)
NVC++/x86-64 Linux 21.3-0: compilation aborted

Compiles correctly with 21.5:

% pgcc -c kernel_particles.c -o kernel_particles.o -O3 -fast -cudalibs -DTEST -Iinclude/ -lm -acc -gpu=managed -V21.5 -w
%

Thanks for the update, Mat!

I can confirm that after upgrading my installations to 21.5 it looks like working again.