All,
I’ve recently been having issues trying to get PGI 16.5 + Open MPI working on the cluster here, to the point where I just had to abandon Open MPI 1.10.x and move to Open MPI 2.0.0, which seems to work, except it can’t seem to use our Infiniband network and has to use the tcp btl. All of this is detailed on a mailing list thread here:
https://www.open-mpi.org/community/lists/users/2016/07/29656.php
One of our gurus here created a small reproducer:
#include <string.h>
#include <infiniband/verbs.h>
struct ompi_device {
struct ibv_device *ib_dev;
struct ibv_exp_device_attr ib_exp_dev_attr;
struct ibv_context *ib_dev_context;
};
#define btl_error(...) fprintf(stderr, __VA_ARGS__); fprintf(stderr, "\n");
#define BTL_ERROR(args) btl_error args
int main() {
struct ibv_exp_device_attr exp_dev_attr;
struct ibv_device **device_list;
struct ibv_context *ib_dev_context;
struct ompi_device *device;
device=malloc(sizeof(struct ompi_device));
device_list = ibv_get_device_list(NULL);
if (!device_list)
return -1;
device->ib_dev=device_list[0];
device->ib_dev_context = ibv_open_device(device->ib_dev);
if (!device->ib_dev_context) {
fprintf(stderr, "Error, failed to open the device '%s'\n",
ibv_get_device_name(device->ib_dev));
return -1;
}
/** Begin code snippet from OpenMPI **/
device->ib_exp_dev_attr.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED - 1;
if(ibv_exp_query_device(device->ib_dev_context, &device->ib_exp_dev_attr)){
BTL_ERROR(("error obtaining device attributes for %s errno says %s",
ibv_get_device_name(device->ib_dev), strerror(errno)));
goto error;
}
/** End code snippet from OpenMPI **/
printf("hca_id: %s\n", ibv_get_device_name(device_list[0]));
printf("\tfw ver: %s\n", device->ib_exp_dev_attr.fw_ver);
printf("\tnode guid: %02x%02x:%02x%02x:%02x%02x:%02x%02x\n",
(device->ib_exp_dev_attr.node_guid & 0xFF),
(( device->ib_exp_dev_attr.node_guid >> 8 ) & 0xFF),
(( device->ib_exp_dev_attr.node_guid >> 16 ) & 0xFF),
(( device->ib_exp_dev_attr.node_guid >> 24 ) & 0xFF),
(( device->ib_exp_dev_attr.node_guid >> 32 ) & 0xFF),
(( device->ib_exp_dev_attr.node_guid >> 40 ) & 0xFF),
(( device->ib_exp_dev_attr.node_guid >> 48 ) & 0xFF),
(( device->ib_exp_dev_attr.node_guid >> 56 ) & 0xFF)
);
return 0;
error:
return -1;
}
which seems to show that pgcc is acting differently:
(1023) $ gcc --version
gcc (GCC) 6.1.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
(1024) $ gcc -libverbs ./ib_verbs_q.c -o ib_verbs_q && ./ib_verbs_q
hca_id: mlx5_0
fw ver: 10.12.1100
node guid: e41d:2d03:000d:c7e0
(1032) $ icc -V
Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 16.0.3.210 Build 20160415
Copyright (C) 1985-2016 Intel Corporation. All rights reserved.
(1033) $ icc -libverbs ./ib_verbs_q.c -o ib_verbs_q && ./ib_verbs_q
hca_id: mlx5_0
fw ver: 10.12.1100
node guid: e41d:2d03:000d:c7e0
(1035) $ pgcc -V
pgcc 16.5-0 64-bit target on x86-64 Linux -tp haswell
The Portland Group - PGI Compilers and Tools
Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
(1036) $ pgcc -libverbs ./ib_verbs_q.c -o ib_verbs_q && ./ib_verbs_q
error obtaining device attributes for mlx5_0 errno says Cannot allocate memory
Any ideas on what is happening? Maybe some extra flag needs to be passed in?
Thanks,
Matt