Unified memory: Error at execution

Hi,

I am currently using the GeForce GTX 1080Ti.
I compile it with the managed flag, however upon execution I get this cryptic output. What can the reason be, and how would I go about to resolve it ? Is there something with the current data CUDA will not accept ?

I hope I can get this resolved ! Thanks !

I have not installed CUDA other than what came with the 17.10 community edition.

I compile with the flag :
-ta=tesla:cc60,managed,cuda9.0

OUTPUT :
pool allocator error: there cannot be two blocks ( existing and new) at same address 140373217501472
Block -> Location: 140373217501472, Status: 1, Size: 2304, Header Loc: 0x7fab18000960
Block -> Location: 140373217501472, Status: 1, Size: 2304, Header Loc: 0x7fab10000960
pool allocator error: there cannot be two blocks ( existing and new) at same address 140373217505200
Block -> Location: 140373217505200, Status: 1, Size: 2304, Header Loc: 0x7fab18000e60
Block -> Location: 140373217505200, Status: 1, Size: 2304, Header Loc: 0x7fab10000a00
pool allocator error: there cannot be two blocks ( existing and new) at same address 140373217503984
Block -> Location: 140373217503984, Status: 1, Size: 336, Header Loc: 0x7fab18001390
Block -> Location: 140373217503984, Status: 1, Size: 336, Header Loc: 0x7fab10000b40
pool allocator error: there cannot be two blocks ( existing and new) at same address 140373217503856
Block -> Location: 140373217503856, Status: 1, Size: 128, Header Loc: 0x7fab18001890
Block -> Location: 140373217503856, Status: 1, Size: 128, Header Loc: 0x7fab10000be0
pool allocator error: there cannot be two blocks ( existing and new) at same address 140373217508928
Block -> Location: 140373217508928, Status: 1, Size: 2304, Header Loc: 0x7fab20001ba0
Block -> Location: 140373217508928, Status: 1, Size: 2304, Header Loc: 0x7fab10000d20
pool allocator error: there cannot be two blocks ( existing and new) at same address 140373217495440
Block -> Location: 140373217495440, Status: 1, Size: 2304, Header Loc: 0x7fab200008c0
Block -> Location: 140373217495440, Status: 1, Size: 2304, Header Loc: 0x7fab080008c0
pool allocator error: there cannot be two blocks ( existing and new) at same address 140373217499168
Block -> Location: 140373217499168, Status: 1, Size: 2304, Header Loc: 0x7fab20000d50
Block -> Location: 140373217499168, Status: 1, Size: 2304, Header Loc: 0x7fab08000960
pool allocator error: there cannot be two blocks ( existing and new) at same address 140373217533936
pool allocator error: there cannot be two blocks ( existing and new) at same address 140373217514960
Block -> Location: 140373217514960, Status: 1, Size: 2304, Header Loc: 0x7fab20002560
Block -> Location: 140373217514960, Status: 1, Size: 2304, Header Loc: 0x7fab180026b0
Segmentation fault (core dumped)


This is the output from pgiaccelinfo:
CUDA Driver Version: 9010
NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.25 Wed Jan 24 20:02:43 PST 2018

Device Number: 0
Device Name: GeForce GTX 1080 Ti
Device Revision Number: 6.1
Global Memory Size: 11706630144
Number of Multiprocessors: 28
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1582 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 5505 MHz
Memory Bus Width: 352 bits
L2 Cache Size: 2883584 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60

Hi olavaaf,

I have not seen this error before so am not sure what’s wrong. As a work around you can try disabling the pool allocator (via the environment variable “PGI_ACC_POOL_ALLOC=0”).

Are you able to share the code? I’d be interested investigating the issue.
If so, please either post or send to PGI Customer Service (trs@pgroup.com) a reproducing example.

Thanks,
Mat

Hi Mat,

Thank you for the quick response. I set the environment flag you mentioned, and now the printout upon launching is this:

malloc: call to cuMemAllocManaged returned error 201: Invalid context
malloc: call to cuMemAllocManaged returned error 201: Invalid context
0: ALLOCATE: 1440 bytes requested; not enough memory
0: ALLOCATE: 1440 bytes requested; not enough memory
malloc: call to cuMemAllocManaged returned error 201: Invalid context
malloc: call to cuMemAllocManaged returned error 201: Invalid context
0: ALLOCATE: 1440 bytes requested; not enough memory
0: ALLOCATE: 1440 bytes requested; not enough memory

It is quite a big software, I will compress it and send you the code.
Thanks!

Best regards,
Olav

UPDATE: I sent you the link; please look in hecmw1/src/solver for the mentioned source-code files.

Thanks Olav,

I’m downloading the code now and will take a look when I get a chance.

It’s interesting that the problem is that you’re running out of memory. Do you know how much memory you expect your program to use? Does the program run correctly when not using the GPU (i.e. compile to target the host, “-ta=multicore”)?

-Mat

Thanks Mat,

It will consume about 450 MB with that example input data I mentioned. Yes, it runs perfectly fine with “-ta=multicore” … (in that case I cannot add “managed”), converging after 2969 iterations.

I hope we find out about this.

Thanks again,
Olav

Mat,

About the issue, it would only occur when using “managed” as a flag.
Are you able to reproduce it ?

Could there be something additionally needed that I might have overlooked?

By the used of the “managed” flag it is my understanding that the runtime will place the needed data for the kernels in GPU memory and keep it there for as long as needed. Automatically.
Is that understanding correct ?

Thanks.
Olav

Hi Olav,

Sorry that I missed your follow-up emails to TRS. Yes, I was running using managed, though as you noted, my runs converged in 7 time steps, while you said that it should have take 2969 steps. Most likely I wasn’t getting to the pointer where it was running out of memory.

I’ll see if I can figure out what’s wrong. Though I’m a bit swamped right now so it may be a bit.

-Mat

Hi Mat,

Thank you for your time. From the printout you sent me I can see it diverged after the 7 iterations in that case, as if the preconditioner became unstable with the managed memory enabled. As you mentioned, some error could have occurred running out of memory. However, that problem should only consume 450MB, so if so I do not understand that.

  • Step control not defined! Using default step=1
    fstr_setup: OK

3x3 B-SSOR-CG(0) 1

1 1.217365E+00
2 1.327884E+00
3 1.366001E+00
4 1.340422E+00
5 1.375788E+00
6 1.384742E+00

HEC-MW-SOLVER-W-3003:

diverged due to indefinite preconditioner

If managed memory is disabled it will converge after those 2969 iterations. If you can reproduce that, by disabling “managed”, I think maybe we have found some issue. I.e. an issue when managed is enabled compared to disabled (which gives correct result).

Thanks again.

Regards,
Olav