nvblas + numpy + IntelMKL >2018.3 not work.

Platform is Ubuntu 16.04.5. GPU NVIDIA1070 cuda-9.2 python3.5.2

I use python with nvblas support by compiling numpy against intel MKL. It works with 2018.1 and 2018.2 but GPU is not used for MKL 2018.3, 2018.4 and 2019 preview

Here is a simple experiment (I have tried different versions of numpy with same result)

LD_PRELOAD=/usr/local/cuda/targets/x86_64-linux/lib/libnvblas.so python3
[NVBLAS] NVBLAS_CONFIG_FILE environment variable is set to '/home/bernard/.config/nvblas.conf'
Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> np.__version__
'1.15.2'
>>> np.show_config()
lapack_mkl_info:
    include_dirs = ['/opt/intel/mkl/include']
    library_dirs = ['/opt/intel/mkl/lib/intel64/']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    libraries = ['mkl_rt', 'pthread']
blas_mkl_info:
    include_dirs = ['/opt/intel/mkl/include']
    library_dirs = ['/opt/intel/mkl/lib/intel64/']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    libraries = ['mkl_rt', 'pthread']
blas_opt_info:
    include_dirs = ['/opt/intel/mkl/include']
    library_dirs = ['/opt/intel/mkl/lib/intel64/']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    libraries = ['mkl_rt', 'pthread']
lapack_opt_info:
    include_dirs = ['/opt/intel/mkl/include']
    library_dirs = ['/opt/intel/mkl/lib/intel64/']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    libraries = ['mkl_rt', 'pthread']
>>> a = np.random.rand(10000, 10000)
>>> b = np.random.rand(10000, 10000)
>>> a@b

With intel mkl version 2018.1 and 2018.2 nvidia-smi shows that Volatile GPU-Util as 100%. nvblas.log

shows
[NVBLAS] Using devices :0 
[NVBLAS] Config parsed
[NVBLAS] dgemm[gpu]: ta=N, tb=N, m=10000, n=10000, k=10000

But for newer intel mkl (2018.3, 2018.4 and 2019.0 preview) Volatile GPU-Util shows 0 % and nvblas.log is blank, so the gpu is not used at all.

Switched back to mkl 2018.2 now it works again.

I checked my nvblas.config. Everything seems to be in order (path not tied to specific versions of intel MKL)

# This is the configuration file to use NVBLAS Library
# Setup the environment variable NVBLAS_CONFIG_FILE to specify your own config file.
# By default, if NVBLAS_CONFIG_FILE is not defined, 
# NVBLAS Library will try to open the file "nvblas.conf" in its current directory
# Example : NVBLAS_CONFIG_FILE  /home/cuda_user/my_nvblas.conf
# The config file should have restricted write permissions accesses

# Specify which output log file (default is stderr)
NVBLAS_LOGFILE  nvblas.log

# Enable trace log of every intercepted BLAS calls
NVBLAS_TRACE_LOG_ENABLED

#Put here the CPU BLAS fallback Library of your choice
#It is strongly advised to use full path to describe the location of the CPU Library
#NVBLAS_CPU_BLAS_LIB  /usr/lib/libblas.so
NVBLAS_CPU_BLAS_LIB /opt/intel/mkl/lib/intel64/libmkl_rt.so
#NVBLAS_CPU_BLAS_LIB/home/bernard/opt/openblas-mpi/lib/libopenblas.so

# List of GPU devices Id to participate to the computation 
# Use ALL if you want all your GPUs to contribute
# Use ALL0, if you want all your GPUs of the same type as device 0 to contribute
# However, NVBLAS consider that all GPU have the same performance and PCI bandwidth
# By default if no GPU are listed, only device 0 will be used

#NVBLAS_GPU_LIST 0 2 4
#NVBLAS_GPU_LIST ALL
NVBLAS_GPU_LIST ALL0

# Tile Dimension
NVBLAS_TILE_DIM 2048

# Autopin Memory
NVBLAS_AUTOPIN_MEM_ENABLED

#List of BLAS routines that are prevented from running on GPU (use for debugging purpose
# The current list of BLAS routines supported by NVBLAS are
# GEMM, SYRK, HERK, TRSM, TRMM, SYMM, HEMM, SYR2K, HER2K

#NVBLAS_GPU_DISABLED_SGEMM 
#NVBLAS_GPU_DISABLED_DGEMM 
#NVBLAS_GPU_DISABLED_CGEMM 
#NVBLAS_GPU_DISABLED_ZGEMM 

# Computation can be optionally hybridized between CPU and GPU
# By default, GPU-supported BLAS routines are ran fully on GPU
# The option NVBLAS_CPU_RATIO_<BLAS_ROUTINE> give the ratio [0,1] 
# of the amount of computation that should be done on CPU
# CAUTION : this option should be used wisely because it can actually
# significantly reduced the overall performance if too much work is given to CPU
#NVBLAS_CPU_RATIO_CGEMM 0.07

numpy may choose to use cblas_gemm interface/API:

https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm

rather than fortran-style BLAS dgemm or dgemm_ interface/API. If it does that, then NVBLAS will not intercept the call:

https://docs.nvidia.com/cuda/nvblas/index.html#symbols-interception

I don’t know for sure that this is the issue; you wouldn’t think that simply changing a library linked to would have this effect. However it may be that numpy is inspecting the linked blas implementation, and choosing cblas instead of ordinary blas according to some heuristic. This should probably be testable with a tool like strace.

Is there any difference in the np.show_config() output in the two cases?