Hi,
I’m on a node with two GPUs attached, running with CUDA-enabled MPI (got it from installing the PGI toolkit) , and I bumped into a weird segfault that I managed to pinpoint to the following minimal reproducer:
#include <mpi.h>
#include "openacc.h"
#include <cuda.h>
#include <cuda_runtime.h>
#include "mpi-ext.h" /* Needed for CUDA-aware check */
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
if (1 == MPIX_Query_cuda_support()) {
printf("This MPI library has CUDA-aware support.\n");
} else {
printf("This MPI library does not have CUDA-aware support.\n");
}
int rank = -1;
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
//printf("rank=%d\n", rank);
int ngpus = acc_get_num_devices(acc_device_nvidia);
int devicenum = (rank)%(ngpus);
//printf("devicenum=%d\n", devicenum);
acc_set_device_num(devicenum,acc_device_nvidia);
acc_init(acc_device_nvidia);
//int buffer[10];
int *buffer = acc_malloc((size_t)10*sizeof(int));
for (int i=0; i<10; i++) buffer[i] = i;
#pragma acc enter data copyin(buffer[:10])
if (rank == 0) {
MPI_Send(acc_deviceptr(buffer), 10, MPI_INT, 1, 0, MPI_COMM_WORLD);
}
else {
MPI_Recv(acc_deviceptr(buffer), 10, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
#pragma acc update host(buffer[:10])
printf("rank=1, %d\n", buffer[2]);
}
//acc_free(buffer);
#pragma acc exit data delete(buffer)
MPI_Finalize();
}
If I comment out
int *buffer = acc_malloc((size_t)10*sizeof(int));
and replace it with
int buffer[10];
then it works and no segfault is thrown.
Here’s the error trace:
[pgi-nc12-openacc:43092] *** Process received signal ***
[pgi-nc12-openacc:43092] Signal: Segmentation fault (11)
[pgi-nc12-openacc:43092] Signal code: Invalid permissions (2)
[pgi-nc12-openacc:43092] Failing at address: 0x1f03a5a000
[pgi-nc12-openacc:43093] *** Process received signal ***
[pgi-nc12-openacc:43093] Signal: Segmentation fault (11)
[pgi-nc12-openacc:43093] Signal code: Invalid permissions (2)
[pgi-nc12-openacc:43093] Failing at address: 0x1f03a5a000
[pgi-nc12-openacc:43092] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f1056fb0890]
[pgi-nc12-openacc:43092] [ 1] ./mfe[0x401381]
[pgi-nc12-openacc:43092] [ 2] [pgi-nc12-openacc:43093] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7f752da52890]
[pgi-nc12-openacc:43093] [ 1] ./mfe[0x401381]
[pgi-nc12-openacc:43093] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f752cb64b97]
[pgi-nc12-openacc:43093] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f10560c2b97]
[pgi-nc12-openacc:43092] [ 3] ./mfe[0x4011ea]
[pgi-nc12-openacc:43092] *** End of error message ***
[ 3] ./mfe[0x4011ea]
[pgi-nc12-openacc:43093] *** End of error message ***
In particular, see that “Invalid permission”.
Everything is open source and running on a VM , so if at a loss (I am :-) ) I’m willing to give ssh access to the machine and the commands to reproduce.
BTW, I’m using pgcc 19.10
Thanks a lot!