Managed Memory with OpenMP appear to be slow and buggy

Hi Programmers,

I refer to the “Example, Multi-threaded BLAS Test” section in “Fortran CUDA Library Interfaces”.

I have Tesla P100 in my workstation and I would like to use Managed Memory instead of Device Memory.

I have increased the size of the matrix from 10k to 100k.
I have changed “, device” to “, managed” and recompiled using PGI 19.9.

When I used 44 threads, I received the following error message:

CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=44 ./a.out
Running with 44 threads, each section = 2272
0: copyin Memcpy (dev=0x7e4914000000, host=0x604b60, size=40000000000) FAILED: 700(an illegal memory access was encountered)
0: copyin Memcpy (dev=0x7d8a6e000000, host=0x1748758b60, size=908800000) FAILED: 700(an illegal memory access was encountered)
0: copyin Memcpy (dev=0x7e650a000000, host=0x604b60, size=40000000000) FAILED: 700(an illegal memory access was encountered)
0: copyin Memcpy (dev=0x7d92be000000, host=0x10eea29b60, size=908800000) FAILED: 700(an illegal memory access was encountered)

Hi pcheechoung89936,

For arrays larger than 2GB (by setting the size to 100K, the arrays are 37GB), please compile with the flag “-mcmodel=medium”.

Also per the comments at the top of the source, the test asks that you set the number of OMP threads to the number of devices on the system.

Hope this helps,
Mat