Odd Error with pgcc 16.5, Infiniband, and Open MPI 2.0.0

All,

I’ve recently been having issues trying to get PGI 16.5 + Open MPI working on the cluster here, to the point where I just had to abandon Open MPI 1.10.x and move to Open MPI 2.0.0, which seems to work, except it can’t seem to use our Infiniband network and has to use the tcp btl. All of this is detailed on a mailing list thread here:

https://www.open-mpi.org/community/lists/users/2016/07/29656.php

One of our gurus here created a small reproducer:

#include <string.h>
#include <infiniband/verbs.h>
struct ompi_device {
    struct ibv_device *ib_dev;
    struct ibv_exp_device_attr ib_exp_dev_attr;
    struct ibv_context *ib_dev_context;
};

#define btl_error(...) fprintf(stderr, __VA_ARGS__); fprintf(stderr, "\n");
#define BTL_ERROR(args) btl_error args

int main() {
    struct ibv_exp_device_attr exp_dev_attr;
    struct ibv_device **device_list;
    struct ibv_context *ib_dev_context;
    struct ompi_device *device;
    device=malloc(sizeof(struct ompi_device));

    device_list = ibv_get_device_list(NULL);
    if (!device_list)
        return -1;

    device->ib_dev=device_list[0];

    device->ib_dev_context = ibv_open_device(device->ib_dev);
    if (!device->ib_dev_context) {
        fprintf(stderr, "Error, failed to open the device '%s'\n",
                ibv_get_device_name(device->ib_dev));
        return -1;
    }

/** Begin code snippet from OpenMPI **/

    device->ib_exp_dev_attr.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED - 1;
    if(ibv_exp_query_device(device->ib_dev_context, &device->ib_exp_dev_attr)){
        BTL_ERROR(("error obtaining device attributes for %s errno says %s",
                    ibv_get_device_name(device->ib_dev), strerror(errno)));
        goto error;
    }

/** End code snippet from OpenMPI **/

    printf("hca_id: %s\n", ibv_get_device_name(device_list[0]));
    printf("\tfw ver: %s\n", device->ib_exp_dev_attr.fw_ver); 
    printf("\tnode guid: %02x%02x:%02x%02x:%02x%02x:%02x%02x\n",
	(device->ib_exp_dev_attr.node_guid & 0xFF),
	(( device->ib_exp_dev_attr.node_guid >> 8 ) & 0xFF),
	(( device->ib_exp_dev_attr.node_guid >> 16 ) & 0xFF),
	(( device->ib_exp_dev_attr.node_guid >> 24 ) & 0xFF),
	(( device->ib_exp_dev_attr.node_guid >> 32 ) & 0xFF),
	(( device->ib_exp_dev_attr.node_guid >> 40 ) & 0xFF),
	(( device->ib_exp_dev_attr.node_guid >> 48 ) & 0xFF),
	(( device->ib_exp_dev_attr.node_guid >> 56 ) & 0xFF)
   );

   return 0;

   error:
	return -1;
}

which seems to show that pgcc is acting differently:

(1023) $ gcc --version
gcc (GCC) 6.1.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

(1024) $ gcc -libverbs ./ib_verbs_q.c -o ib_verbs_q && ./ib_verbs_q
hca_id: mlx5_0
	fw ver: 10.12.1100
	node guid: e41d:2d03:000d:c7e0



(1032) $ icc -V
Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 16.0.3.210 Build 20160415
Copyright (C) 1985-2016 Intel Corporation.  All rights reserved.

(1033) $ icc -libverbs ./ib_verbs_q.c -o ib_verbs_q && ./ib_verbs_q
hca_id: mlx5_0
	fw ver: 10.12.1100
	node guid: e41d:2d03:000d:c7e0



(1035) $ pgcc -V

pgcc 16.5-0 64-bit target on x86-64 Linux -tp haswell 
The Portland Group - PGI Compilers and Tools
Copyright (c) 2016, NVIDIA CORPORATION.  All rights reserved.
(1036) $ pgcc -libverbs ./ib_verbs_q.c -o ib_verbs_q && ./ib_verbs_q
error obtaining device attributes for mlx5_0 errno says Cannot allocate memory

Any ideas on what is happening? Maybe some extra flag needs to be passed in?

Thanks,
Matt

Here’s an updated reproducer said guru has created.

I think the problem with openmpi here is some weird way pgcc is handling
a left shifted value in an enum that ultimately causes the
ibv_exp_query_device() call to fail.

I come up with this reproducer from the libibverbs source:


#include <stdint.h>
#include <stdio.h>
enum verbs_context_mask {
         VERBS_CONTEXT_XRCD         = (uint64_t)1 << 0,
         VERBS_CONTEXT_SRQ          = (uint64_t)1 << 1,
         VERBS_CONTEXT_QP           = (uint64_t)1 << 2,
         VERBS_CONTEXT_RESERVED     = (uint64_t)1 << 3,
         VERBS_CONTEXT_EXP          = (uint64_t)1 << 62
};

int main() {
         uint64_t verbs_context_exp          = (uint64_t)1 << 62;
         printf("VERBS_CONTEXT_EXP=%lx\n", VERBS_CONTEXT_EXP);
         printf("verbs_context_exp=%lx\n", verbs_context_exp);

	// Make sure there's an actual inequality, not some
	// weird problem with me not printf()'ing them properly
         if ( VERBS_CONTEXT_EXP != verbs_context_exp ) {
                 printf("Ruh-roh! They don't match\n");
                 return 1;
         }

         return 0;
}

Here’s the output:

$ ~/src/ib_verbs_q> ./test.pgi
VERBS_CONTEXT_EXP=0
verbs_context_exp=4000000000000000
Ruh-roh! They don't match

$ ~/src/ib_verbs_q> ./test.gcc
VERBS_CONTEXT_EXP=4000000000000000
verbs_context_exp=4000000000000000

I really would appreciate it if someone from Portland Group would please reply to either this forum post or to the multiple emails that I have sent now on this issue to trs@pgroup.com. It isn’t clear that the emails are making it through as we don’t get even an automated response. I have tried both email and the form on the website.

The first example

pgcc -libverbs -o ib_verbs ib_verbs.c

fails to compile with gcc, icc, and pgcc. Many errors. I am not sure
we are replicating your test.

The second example

#include <stdint.h>
#include <stdio.h>
enum verbs_context_mask {
VERBS_CONTEXT_XRCD = (uint64_t)1 << 0,
VERBS_CONTEXT_SRQ = (uint64_t)1 << 1,
VERBS_CONTEXT_QP = (uint64_t)1 << 2,
VERBS_CONTEXT_RESERVED = (uint64_t)1 << 3,
VERBS_CONTEXT_EXP = (uint64_t)1 << 62
};

int main() {
uint64_t verbs_context_exp = (uint64_t)1 << 62;
printf(“VERBS_CONTEXT_EXP=%lx\n”, VERBS_CONTEXT_EXP);
printf(“verbs_context_exp=%lx\n”, verbs_context_exp);

// Make sure there’s an actual inequality, not some
// weird problem with me not printf()'ing them properly
if ( VERBS_CONTEXT_EXP != verbs_context_exp ) {
printf(“Ruh-roh! They don’t match\n”);
return 1;
}

return 0;
}


we were able to recreate as you did with icc, gcc, and pgcc, and we have logged the problem as TPR 22753.

dave

Dave,
Thanks for responding.

-Nick

I’m running into the exact same issue with the latest 17.1 compiler version.

Are there any workarounds for this? Has any progress been made on bug TPR 22753?

This has been addressed in PGI 17.7.