cuEventCreate error when switching from 17.10 to 18.4

Hi,

switching PGI compiler version from 17.10 to 18.4 all my tests fail with the following error:

call to cuEventCreate returned error 201: Invalid context

Error: _mp_pcpu_reset: lost thread

I suppose the second line is just a consequence of the invalid context issue.

Searching through the forum, this seems to point at some issue with the PGI runtime checking whether a valid context to connect to already exists or not. But in some posts it was mentioned that valgrind detects this as an error, but it shouldn’t really be an error and shouldn’t stop the program from executing normally.

Tracking down the position in the code where it is most likely to happen using output of PGI_ACC_NOTIFY with the various integer arguments, I believe it happens upon a call to #pragma acc update self(…).

Thanks,
LS

HI LS,

Given the “_mp_pcpu_reset” error, I’m assuming that the update directive is within an OpenMP parallel region?

In looking through our internal problem reports, the only similar issue I see is where the user using a mixed OpenMP/OpenACC application didn’t have “-mp” on the link line. Not sure this applies to your case, but can you check if you link with “-mp”?

-Mat

Hi Mat,

the update directive is not within an OpenMP parallel region, but you are right in that it is a mixed OpenMP/OpenACC application and that we were also missing the “-mp” on the link line. I was under the impression that we were using it, but when double checking, it turns out we didn’t. Adding it to the link line fixed the problem in my first test case. I will run more checks, but I am hopeful that this was the root cause.

Thanks for your prompt response as always,
LS

Hi LS,

Sounds like you are running into the same issue as TPR#24412 so hopefully adding “-mp” to the link will fix the problem. We’re looking at adding “-mp” by default to the link line so this issue won’t occur in the future.

-Mat

Hi Mat,

all my single-GPU tests pass when adding “-mp” to the link line (setting this as default would definitely help). BTW, do I have access to view the details of TPR#24412 or only once it has been fixed and it is published with a new release along with the release notes?

The “bad” news is that all my multi-GPU tests still fail despite the use of “-mp”. It may or may not be related to this issue, but since it also occurred when changing from 17.10 to 18.4 and happens at the same source line where acc update self is called, I will describe it here and not in a new ticket to keep the context.

I get the following error whenn acc update self is trying to copy device data to the host for process with rank 1 (for process with rank 0 everything works fine):

FATAL ERROR: data in update host clause was not found on device 1: name=(null)

I understand the error message and so I extracted the host addresses of all the arrays in the update statement. I did an acc_present_dump after allocating and initializing the respective arrays on the device at the beginning of the program.
Just before the acc update self statement I put another acc_present_dump that is only triggers for rank 1. Its output is

Present table dump for device[0]: , threadid=0       
...empty...

So it claims there is nothing on the device.
However, when the acc update self statement is executed and gives the above error, the runtime dumps the present table, which is NOT empty and actually contains the relevant arrays (host/device ptr mapping, correct size, and also shows corresponding allocated blocks with >= size of arrays). Given that output I would say the relevant data should be on the device and it should be possible to copy it to the host.

I find it irritating that acc_present_dump executed by rank 1 just before the acc update self reports “Present table dump for device[0]” and not for device[2] as it does at the beginning of the program and when exiting with an error.

Anything else I can check or any idea what could be wrong here?
The following is my relevant program output:

Addresses of six arrays
0xa6763d0                                            
0xa685100                                            
0xa693e30                                            
0xa4cbe50                                            
0xa618f70                                            
0xa63ad90
Calling acc_present_dump for rank 1 at the beginning of the program
Present table dump for device[2]: NVIDIA Tesla GPU 1, compute capability 6.0, threadid=1                                                                                                                                 
host:0x73e3d20 device:0x2aab03000600 size:4 presentcount:0+1 line:107 name:(null)                                                                                                                                        
host:0x73e65c0 device:0x2aab03000400 size:4 presentcount:0+1 line:107 name:(null)                                                                                                                                        
host:0x73e7230 device:0x2aab03000000 size:248 presentcount:0+1 line:107 name:_T17372424_7326                                                                                                                             
host:0x73e8bf0 device:0x2aab03001400 size:1024 presentcount:0+1 line:7037 name:(null)                                                                                                                                    
host:0x73ece70 device:0x2aab03002000 size:16 presentcount:0+1 line:7075 name:(null)                                                                                                                                      
host:0xa421cb0 device:0x2aab03200000 size:40000 presentcount:0+12 line:6592 name:_T24949392_10803                                                                                                                        
host:0xa42b900 device:0x2aab03477000 size:69376 presentcount:0+1 line:7086 name:(null)                                                                                                                                   
host:0xa43c810 device:0x2aab03488000 size:69376 presentcount:0+1 line:7086 name:(null)                                                                                                                                   
host:0xa4c4090 device:0x2aab03209e00 size:17344 presentcount:0+1 line:5664 name:(null)                                                                                                                                   
host:0xa4c9220 device:0x2aab03001c00 size:256 presentcount:0+1 line:7042 name:(null)                                                                                                                                     
host:0xa4c9330 device:0x2aab03001e00 size:256 presentcount:0+1 line:7047 name:(null)                                                                                                                                     
host:0xa4c9440 device:0x2aab03000c00 size:256 presentcount:0+1 line:6842 name:(null)                                                                                                                                     
host:0xa4c9550 device:0x2aab03000e00 size:256 presentcount:0+1 line:6852 name:(null)                                                                                                                                     
host:0xa4cbe50 device:0x2aab03223600 size:60704 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa4df2e0 device:0x2aab03000200 size:248 presentcount:0+1 line:107 name:_T17372424_7326                                                                                                                             
host:0xa4e3be0 device:0x2aab03001000 size:1024 presentcount:0+1 line:7037 name:(null)                                                                                                                                    
host:0xa4e48d0 device:0x2aab0320e200 size:17344 presentcount:0+1 line:5664 name:(null)                                                                                                                                   
host:0xa4e8ca0 device:0x2aab03212600 size:17344 presentcount:0+1 line:5664 name:(null)                                                                                                                                   
host:0xa4ed070 device:0x2aab03216a00 size:17344 presentcount:0+1 line:5664 name:(null)                                                                                                                                   
host:0xa4f1440 device:0x2aab03001800 size:1024 presentcount:0+1 line:7037 name:(null)                                                                                                                                    
host:0xa4f1850 device:0x2aab03000800 size:1024 presentcount:0+1 line:6592 name:(null)                                                                                                                                    
host:0xa5641f0 device:0x2aab03400000 size:69376 presentcount:0+1 line:6994 name:(null)                                                                                                                                   
host:0xa575100 device:0x2aab03466000 size:69376 presentcount:0+1 line:7086 name:(null)                                                                                                                                   
host:0xa586010 device:0x2aab03411000 size:69376 presentcount:0+1 line:6994 name:(null)                                                                                                                                   
host:0xa596f20 device:0x2aab03422000 size:69376 presentcount:0+1 line:6994 name:(null)                                                                                                                                   
host:0xa5a7e30 device:0x2aab03433000 size:69376 presentcount:0+1 line:6994 name:(null)                                                                                                                                   
host:0xa5b8d40 device:0x2aab03444000 size:69376 presentcount:0+1 line:6994 name:(null)                                                                                                                                   
host:0xa5c9c50 device:0x2aab03455000 size:69376 presentcount:0+1 line:6994 name:(null)                                                                                                                                   
host:0xa5dab60 device:0x2aab0321ae00 size:17344 presentcount:0+1 line:5664 name:(null)                                                                                                                                   
host:0xa5def30 device:0x2aab0321f200 size:17344 presentcount:0+1 line:5664 name:(null)                                                                                                                                   
host:0xa618f70 device:0x2aab03232400 size:60704 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa63ad90 device:0x2aab03499000 size:69120 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa64bba0 device:0x2aab03241200 size:60480 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa65a7f0 device:0x2aab03250000 size:60480 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa669440 device:0x2aab0325ee00 size:53116 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa6763d0 device:0x2aab0326be00 size:60704 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa685100 device:0x2aab0327ac00 size:60704 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa693e30 device:0x2aab034a9e00 size:69120 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa6a4c40 device:0x2aab03289a00 size:60480 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa6b3890 device:0x2aab03298800 size:60480 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
host:0xa6c24e0 device:0x2aab032a7600 size:53116 presentcount:0+1 line:3646 name:(null)                                                                                                                                   
allocated block device:0x2aab03000000 size:512 thread:1                                                                                                                                                                  
allocated block device:0x2aab03000200 size:512 thread:1                                                                                                                                                                  
allocated block device:0x2aab03000400 size:512 thread:1                                                                                                                                                                  
allocated block device:0x2aab03000600 size:512 thread:1                                                                                                                                                                  
allocated block device:0x2aab03000800 size:1024 thread:1                                                                                                                                                                 
allocated block device:0x2aab03000c00 size:512 thread:1                                                                                                                                                                  
allocated block device:0x2aab03000e00 size:512 thread:1                                                                                                                                                                  
allocated block device:0x2aab03001000 size:1024 thread:1                                                                                                                                                                 
allocated block device:0x2aab03001400 size:1024 thread:1                                                                                                                                                                 
allocated block device:0x2aab03001800 size:1024 thread:1                                                                                                                                                                 
allocated block device:0x2aab03001c00 size:512 thread:1                                                                                                                                                                  
allocated block device:0x2aab03001e00 size:512 thread:1                                                                                                                                                                  
allocated block device:0x2aab03002000 size:512 thread:1                                                                                                                                                                  
allocated block device:0x2aab03200000 size:40448 thread:1                                                                                                                                                                
allocated block device:0x2aab03209e00 size:17408 thread:1                                                                                                                                                                
allocated block device:0x2aab0320e200 size:17408 thread:1                                                                                                                                                                
allocated block device:0x2aab03212600 size:17408 thread:1                                                                                                                                                                
allocated block device:0x2aab03216a00 size:17408 thread:1                                                                                                                                                                
allocated block device:0x2aab0321ae00 size:17408 thread:1                                                                                                                                                                
allocated block device:0x2aab0321f200 size:17408 thread:1                                                                                                                                                                
allocated block device:0x2aab03223600 size:60928 thread:1                                                                                                                                                                
allocated block device:0x2aab03232400 size:60928 thread:1                                                                                                                                                                
allocated block device:0x2aab03241200 size:60928 thread:1                                                                                                                                                                
allocated block device:0x2aab03250000 size:60928 thread:1                                                                                                                                                                
allocated block device:0x2aab0325ee00 size:53248 thread:1                                                                                                                                                                
allocated block device:0x2aab0326be00 size:60928 thread:1                                                                                                                                                                
allocated block device:0x2aab0327ac00 size:60928 thread:1                                                                                                                                                                
allocated block device:0x2aab03289a00 size:60928 thread:1                                                                                                                                                                
allocated block device:0x2aab03298800 size:60928 thread:1                                                                                                                                                                
allocated block device:0x2aab032a7600 size:53248 thread:1                                                                                                                                                                
allocated block device:0x2aab03400000 size:69632 thread:1                                                                                                                                                                
allocated block device:0x2aab03411000 size:69632 thread:1                                                                                                                                                                
allocated block device:0x2aab03422000 size:69632 thread:1                                                                                                                                                                
allocated block device:0x2aab03433000 size:69632 thread:1                                                                                                                                                                
allocated block device:0x2aab03444000 size:69632 thread:1                                                                                                                                                                
allocated block device:0x2aab03455000 size:69632 thread:1                                                                                                                                                                
allocated block device:0x2aab03466000 size:69632 thread:1                                                                                                                                                                
allocated block device:0x2aab03477000 size:69632 thread:1                                                                                                                                                                
allocated block device:0x2aab03488000 size:69632 thread:1                                                                                                                                                                
allocated block device:0x2aab03499000 size:69120 thread:1                                                                                                                                                                
allocated block device:0x2aab034a9e00 size:69120 thread:1                                                                                                                                                                
deleted block   device:0x2aab03002200 size:512 thread 1      
Calling acc_present_dump for rank 1 before acc update self
Present table dump for device[0]: , threadid=0       
...empty...                                          

Present table dump for device[2]: NVIDIA Tesla GPU 1, compute capability 6.0, threadid=2
host:0x73d7bd0 device:0x2aab03007c00 size:32 presentcount:0+1 line:2358 name:(null)     
host:0x73e3d20 device:0x2aab03000600 size:4 presentcount:0+1 line:107 name:(null)       
host:0x73e4720 device:0x2aab03007800 size:32 presentcount:0+1 line:2358 name:(null)     
host:0x73e4750 device:0x2aab03007400 size:32 presentcount:0+1 line:2358 name:(null)     
host:0x73e65c0 device:0x2aab03000400 size:4 presentcount:0+1 line:107 name:(null)       
host:0x73e7230 device:0x2aab03000000 size:248 presentcount:0+1 line:107 name:_T17372424_7326
host:0x73e7990 device:0x2aab03007e00 size:32 presentcount:0+1 line:2358 name:(null)         
host:0x73e7a20 device:0x2aab03007a00 size:32 presentcount:0+1 line:2358 name:(null)         
host:0x73e7a50 device:0x2aab03007600 size:32 presentcount:0+1 line:2358 name:(null)         
host:0x73e7ab0 device:0x2aab03608800 size:1084 presentcount:0+1 line:2400 name:(null)       
host:0x73e7f00 device:0x2aab03608200 size:1084 presentcount:0+1 line:2400 name:(null)       
host:0x73e8350 device:0x2aab03607600 size:1084 presentcount:0+1 line:2358 name:(null)       
host:0x73e87a0 device:0x2aab03607000 size:1084 presentcount:0+1 line:2358 name:(null)       
host:0x73e8bf0 device:0x2aab03001400 size:1024 presentcount:0+1 line:7037 name:(null)       
host:0x73ece70 device:0x2aab03002000 size:16 presentcount:0+1 line:7075 name:(null)         
host:0xa421cb0 device:0x2aab03200000 size:40000 presentcount:0+6783 line:6592 name:_T24949392_10803
host:0xa42b900 device:0x2aab03477000 size:69376 presentcount:0+1 line:7086 name:(null)             
host:0xa43c810 device:0x2aab03488000 size:69376 presentcount:0+1 line:7086 name:(null)             
host:0xa4c4090 device:0x2aab03209e00 size:17344 presentcount:0+1 line:5664 name:(null)             
host:0xa4c9220 device:0x2aab03001c00 size:256 presentcount:0+1 line:7042 name:(null)               
host:0xa4c9330 device:0x2aab03001e00 size:256 presentcount:0+1 line:7047 name:(null)               
host:0xa4c9440 device:0x2aab03000c00 size:256 presentcount:0+1 line:6842 name:(null)               
host:0xa4c9550 device:0x2aab03000e00 size:256 presentcount:0+1 line:6852 name:(null)               
host:0xa4cbe50 device:0x2aab03223600 size:60704 presentcount:0+1 line:3646 name:(null)             
host:0xa4df2e0 device:0x2aab03000200 size:248 presentcount:0+1 line:107 name:_T17372424_7326       
host:0xa4e3640 device:0x2aab03009600 size:60 presentcount:0+4498 line:1444 name:bE                 
host:0xa4e3690 device:0x2aab03008000 size:60 presentcount:0+4498 line:1792 name:bH                 
host:0xa4e36e0 device:0x2aab03008400 size:60 presentcount:0+4498 line:1444 name:cE                 
host:0xa4e3730 device:0x2aab03008200 size:60 presentcount:0+4498 line:1792 name:cH                 
host:0xa4e3be0 device:0x2aab03001000 size:1024 presentcount:0+1 line:7037 name:(null)              
host:0xa4e4480 device:0x2aab03607c00 size:1084 presentcount:0+1 line:2358 name:(null)              
host:0xa4e48d0 device:0x2aab0320e200 size:17344 presentcount:0+1 line:5664 name:(null)             
host:0xa4e8ca0 device:0x2aab03212600 size:17344 presentcount:0+1 line:5664 name:(null)             
host:0xa4ed070 device:0x2aab03216a00 size:17344 presentcount:0+1 line:5664 name:(null)             
host:0xa4f1440 device:0x2aab03001800 size:1024 presentcount:0+1 line:7037 name:(null)              
host:0xa4f1850 device:0x2aab03000800 size:1024 presentcount:0+1 line:6592 name:(null)              
host:0xa5641f0 device:0x2aab03400000 size:69376 presentcount:0+1 line:6994 name:(null)             
host:0xa575100 device:0x2aab03466000 size:69376 presentcount:0+1 line:7086 name:(null)             
host:0xa586010 device:0x2aab03411000 size:69376 presentcount:0+1 line:6994 name:(null)             
host:0xa596f20 device:0x2aab03422000 size:69376 presentcount:0+1 line:6994 name:(null)             
host:0xa5a7e30 device:0x2aab03433000 size:69376 presentcount:0+1 line:6994 name:(null)             
host:0xa5b8d40 device:0x2aab03444000 size:69376 presentcount:0+1 line:6994 name:(null)             
host:0xa5c9c50 device:0x2aab03455000 size:69376 presentcount:0+1 line:6994 name:(null)             
host:0xa5dab60 device:0x2aab0321ae00 size:17344 presentcount:0+1 line:5664 name:(null)             
host:0xa5def30 device:0x2aab0321f200 size:17344 presentcount:0+1 line:5664 name:(null)             
host:0xa618f70 device:0x2aab03232400 size:60704 presentcount:0+1 line:3646 name:(null)             
host:0xa63ad90 device:0x2aab03499000 size:69120 presentcount:0+1 line:3646 name:(null)             
host:0xa64bba0 device:0x2aab03241200 size:60480 presentcount:0+1 line:3646 name:(null)             
host:0xa65a7f0 device:0x2aab03250000 size:60480 presentcount:0+1 line:3646 name:(null)             
host:0xa669440 device:0x2aab0325ee00 size:53116 presentcount:0+1 line:3646 name:(null)             
host:0xa6763d0 device:0x2aab0326be00 size:60704 presentcount:0+1 line:3646 name:(null)             
host:0xa685100 device:0x2aab0327ac00 size:60704 presentcount:0+1 line:3646 name:(null)             
host:0xa693e30 device:0x2aab034a9e00 size:69120 presentcount:0+1 line:3646 name:(null)             
host:0xa6a4c40 device:0x2aab03289a00 size:60480 presentcount:0+1 line:3646 name:(null)             
host:0xa6b3890 device:0x2aab03298800 size:60480 presentcount:0+1 line:3646 name:(null)             
host:0xa6c24e0 device:0x2aab032a7600 size:53116 presentcount:0+1 line:3646 name:(null)             
host:0xa6cf470 device:0x2aab03002600 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa6cf880 device:0x2aab03002a00 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa6cfc90 device:0x2aab03002e00 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa6d00a0 device:0x2aab03003200 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa6d04b0 device:0x2aab03003600 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa6d08c0 device:0x2aab03003a00 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa6d1e50 device:0x2aab03002200 size:256 presentcount:0+1 line:3903 name:(null)               
host:0xa6d1f60 device:0x2aab03003e00 size:256 presentcount:0+1 line:3903 name:(null)               
host:0xa6d23e0 device:0x2aab03004000 size:256 presentcount:0+1 line:3921 name:(null)               
host:0xa6d24f0 device:0x2aab03004200 size:256 presentcount:0+1 line:3921 name:(null)               
host:0xa6d2ce0 device:0x2aab032b4600 size:60704 presentcount:0+1 line:4052 name:(null)             
host:0xa6e1a10 device:0x2aab032c3400 size:60704 presentcount:0+1 line:4052 name:(null)             
host:0xa6f0740 device:0x2aab034bac00 size:69120 presentcount:0+1 line:4052 name:(null)             
host:0xa701fa0 device:0x2aab03004400 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa7023b0 device:0x2aab03004800 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa7027c0 device:0x2aab03004c00 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa702bd0 device:0x2aab03005000 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa702fe0 device:0x2aab03005400 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa7033f0 device:0x2aab03005800 size:1024 presentcount:0+1 line:3880 name:(null)              
host:0xa704ca0 device:0x2aab03005c00 size:256 presentcount:0+1 line:3903 name:(null)               
host:0xa704db0 device:0x2aab03005e00 size:256 presentcount:0+1 line:3903 name:(null)               
host:0xa7055a0 device:0x2aab03006000 size:256 presentcount:0+1 line:3921 name:(null)               
host:0xa7056b0 device:0x2aab03006200 size:256 presentcount:0+1 line:3921 name:(null)               
host:0xa706cb0 device:0x2aab032d2200 size:60704 presentcount:0+1 line:4052 name:(null)             
host:0xa7159e0 device:0x2aab032e1000 size:60704 presentcount:0+1 line:4052 name:(null)             
host:0xa724710 device:0x2aab034cba00 size:69120 presentcount:0+1 line:4052 name:(null)             
host:0xa737190 device:0x2aab03600000 size:3360 presentcount:0+1 line:4652 name:(null)              
host:0xa737ec0 device:0x2aab03600e00 size:3360 presentcount:0+1 line:4652 name:(null)              
host:0xa738bf0 device:0x2aab03601c00 size:3360 presentcount:0+1 line:4652 name:(null)              
host:0xa739920 device:0x2aab03602a00 size:3360 presentcount:0+1 line:4652 name:(null)              
host:0xa73c930 device:0x2aab03603800 size:3360 presentcount:0+1 line:4652 name:(null)              
host:0xa73d660 device:0x2aab03604600 size:3360 presentcount:0+1 line:4652 name:(null)              
host:0xa73e390 device:0x2aab03605400 size:3360 presentcount:0+1 line:4652 name:(null)              
host:0xa73f0c0 device:0x2aab03606200 size:3360 presentcount:0+1 line:4652 name:(null)              
host:0xa740bb0 device:0x2aab03006400 size:256 presentcount:0+1 line:4745 name:(null)               
host:0xa740cc0 device:0x2aab03006600 size:256 presentcount:0+1 line:4745 name:(null)               
host:0xa740dd0 device:0x2aab03006800 size:256 presentcount:0+1 line:4745 name:(null)               
host:0xa740ee0 device:0x2aab03006a00 size:256 presentcount:0+1 line:4745 name:(null)               
host:0xa741db0 device:0x2aab03006c00 size:256 presentcount:0+1 line:4745 name:(null)               
host:0xa741ec0 device:0x2aab03006e00 size:256 presentcount:0+1 line:4745 name:(null)               
host:0xa741fd0 device:0x2aab03007000 size:256 presentcount:0+1 line:4745 name:(null)               
host:0xa7420e0 device:0x2aab03007200 size:256 presentcount:0+1 line:4745 name:(null)               
allocated block device:0x2aab03000000 size:512 thread:1                                            
allocated block device:0x2aab03000200 size:512 thread:1                                            
allocated block device:0x2aab03000400 size:512 thread:1                                            
allocated block device:0x2aab03000600 size:512 thread:1                                            
allocated block device:0x2aab03000800 size:1024 thread:1                                           
allocated block device:0x2aab03000c00 size:512 thread:1                                            
allocated block device:0x2aab03000e00 size:512 thread:1                                            
allocated block device:0x2aab03001000 size:1024 thread:1                                           
allocated block device:0x2aab03001400 size:1024 thread:1                                           
allocated block device:0x2aab03001800 size:1024 thread:1                                           
allocated block device:0x2aab03001c00 size:512 thread:1                                            
allocated block device:0x2aab03001e00 size:512 thread:1                                            
allocated block device:0x2aab03002000 size:512 thread:1                                            
allocated block device:0x2aab03002200 size:512 thread:1                                            
allocated block device:0x2aab03002600 size:1024 thread:1                                           
allocated block device:0x2aab03002a00 size:1024 thread:1                                           
allocated block device:0x2aab03002e00 size:1024 thread:1                                           
allocated block device:0x2aab03003200 size:1024 thread:1                                           
allocated block device:0x2aab03003600 size:1024 thread:1                                           
allocated block device:0x2aab03003a00 size:1024 thread:1                                           
allocated block device:0x2aab03003e00 size:512 thread:1                                            
allocated block device:0x2aab03004000 size:512 thread:1                                            
allocated block device:0x2aab03004200 size:512 thread:1                                            
allocated block device:0x2aab03004400 size:1024 thread:1                                           
allocated block device:0x2aab03004800 size:1024 thread:1                                           
allocated block device:0x2aab03004c00 size:1024 thread:1                                           
allocated block device:0x2aab03005000 size:1024 thread:1                                           
allocated block device:0x2aab03005400 size:1024 thread:1                                           
allocated block device:0x2aab03005800 size:1024 thread:1                                           
allocated block device:0x2aab03005c00 size:512 thread:1                                            
allocated block device:0x2aab03005e00 size:512 thread:1                                            
allocated block device:0x2aab03006000 size:512 thread:1                                            
allocated block device:0x2aab03006200 size:512 thread:1                                            
allocated block device:0x2aab03006400 size:512 thread:1                                            
allocated block device:0x2aab03006600 size:512 thread:1                                            
allocated block device:0x2aab03006800 size:512 thread:1                                            
allocated block device:0x2aab03006a00 size:512 thread:1                                            
allocated block device:0x2aab03006c00 size:512 thread:1                                            
allocated block device:0x2aab03006e00 size:512 thread:1                                            
allocated block device:0x2aab03007000 size:512 thread:1                                            
allocated block device:0x2aab03007200 size:512 thread:1                                            
allocated block device:0x2aab03007400 size:512 thread:1                                            
allocated block device:0x2aab03007600 size:512 thread:1                                            
allocated block device:0x2aab03007800 size:512 thread:1                                            
allocated block device:0x2aab03007a00 size:512 thread:1                                            
allocated block device:0x2aab03007c00 size:512 thread:1                                            
allocated block device:0x2aab03007e00 size:512 thread:1                                            
allocated block device:0x2aab03008000 size:512 thread:1                                            
allocated block device:0x2aab03008200 size:512 thread:1                                            
allocated block device:0x2aab03008400 size:512 thread:1                                            
allocated block device:0x2aab03009600 size:512 thread:1                                            
allocated block device:0x2aab03200000 size:40448 thread:1                                          
allocated block device:0x2aab03209e00 size:17408 thread:1                                          
allocated block device:0x2aab0320e200 size:17408 thread:1                                          
allocated block device:0x2aab03212600 size:17408 thread:1                                          
allocated block device:0x2aab03216a00 size:17408 thread:1                                          
allocated block device:0x2aab0321ae00 size:17408 thread:1                                          
allocated block device:0x2aab0321f200 size:17408 thread:1
allocated block device:0x2aab03223600 size:60928 thread:1
allocated block device:0x2aab03232400 size:60928 thread:1
allocated block device:0x2aab03241200 size:60928 thread:1
allocated block device:0x2aab03250000 size:60928 thread:1
allocated block device:0x2aab0325ee00 size:53248 thread:1
allocated block device:0x2aab0326be00 size:60928 thread:1
allocated block device:0x2aab0327ac00 size:60928 thread:1
allocated block device:0x2aab03289a00 size:60928 thread:1
allocated block device:0x2aab03298800 size:60928 thread:1
allocated block device:0x2aab032a7600 size:53248 thread:1
allocated block device:0x2aab032b4600 size:60928 thread:1
allocated block device:0x2aab032c3400 size:60928 thread:1
allocated block device:0x2aab032d2200 size:60928 thread:1
allocated block device:0x2aab032e1000 size:60928 thread:1
allocated block device:0x2aab03400000 size:69632 thread:1
allocated block device:0x2aab03411000 size:69632 thread:1
allocated block device:0x2aab03422000 size:69632 thread:1
allocated block device:0x2aab03433000 size:69632 thread:1
allocated block device:0x2aab03444000 size:69632 thread:1
allocated block device:0x2aab03455000 size:69632 thread:1
allocated block device:0x2aab03466000 size:69632 thread:1
allocated block device:0x2aab03477000 size:69632 thread:1
allocated block device:0x2aab03488000 size:69632 thread:1
allocated block device:0x2aab03499000 size:69120 thread:1
allocated block device:0x2aab034a9e00 size:69120 thread:1
allocated block device:0x2aab034bac00 size:69120 thread:1
allocated block device:0x2aab034cba00 size:69120 thread:1
allocated block device:0x2aab03600000 size:3584 thread:1
allocated block device:0x2aab03600e00 size:3584 thread:1
allocated block device:0x2aab03601c00 size:3584 thread:1
allocated block device:0x2aab03602a00 size:3584 thread:1
allocated block device:0x2aab03603800 size:3584 thread:1
allocated block device:0x2aab03604600 size:3584 thread:1
allocated block device:0x2aab03605400 size:3584 thread:1
allocated block device:0x2aab03606200 size:3584 thread:1
allocated block device:0x2aab03607000 size:1536 thread:1
allocated block device:0x2aab03607600 size:1536 thread:1
allocated block device:0x2aab03607c00 size:1536 thread:1
allocated block device:0x2aab03608200 size:1536 thread:1
allocated block device:0x2aab03608800 size:1536 thread:1
deleted block   device:0x2aab03008c00 size:512 thread 1
deleted block   device:0x2aab03008800 size:512 thread 1
deleted block   device:0x2aab03009400 size:512 thread 1
deleted block   device:0x2aab03009a00 size:512 thread 1
deleted block   device:0x2aab03008600 size:512 thread 1
deleted block   device:0x2aab03009200 size:512 thread 1
deleted block   device:0x2aab03008e00 size:512 thread 1
deleted block   device:0x2aab03009800 size:512 thread 1
deleted block   device:0x2aab03009000 size:512 thread 1
deleted block   device:0x2aab03008a00 size:512 thread 1
deleted block   device:0x2aab03608e00 size:1536 thread 1
FATAL ERROR: data in update host clause was not found on device 1: name=(null)
 file:/slowfs/scratch_nfs/lschneid/src/release_main/emsolve/emsolve/emsolve/sfdtd/Sfdtd3DSimulation.C _ZN17sfdtd3DSimulation13accUpdateSelfEN3EMS12field_type_tEi line:246

Thanks,
LS

Hi LS,

BTW, do I have access to view the details of TPR#24412 or only once it has been fixed and it is published with a new release along with the release notes?

Unfortunately, no. We tried to put together an external bug tracker many years ago but there were several problems. First was logistical, in that for security reasons, our web server is not connected to our internal systems. Hence it was very problematic in maintaining and synchronizing both an internal and external system. Also, there were concerns about the confidentiality of our user’s bug reports. We didn’t want to accidentally make public confidential code sent to us. Tracking what could be public and what was private was too laborious and error prone.

It’s actually one of the reasons why we started the User Forums. Given that the UF is inherently public, we didn’t need to worry about confidentiality of code and issues posted here.

As for this issue, first to clarify terminology. I typically will use the term “rank” for a MPI process but here I think your using “rank” for what I would call an OpenMP thread.

How are you doing your device assignment for each thread?

What it appears to me to be happening is that one of the threads is being assigned to device 2 and then allocating the data on the device. The other threads are being assigned a device, but not allocating data on them.

Hence, when you exit the OpenMP parallel region, the master thread is still assigned to device 1 and therefor does not have access to the data on device 2. If this is the case, you’ll need call “acc_set_device_num” or use “#pragma acc set device_num”, to set the master thread to use device 2. Then the update should work.

Keeping track of which thread is assigned to which device and what data is allocated on each, can be tricky. It’s one of the reasons that I prefer using MPI for multi-GPU program. With MPI, you’d assign each rank to a single device making it a one-to-one relation between the host and device as opposed to a many-to-many relationship when using OpenMP.

I’m thinking that if issues persist, it might be best that I contact you directly. Will you be attending SC18 next week? If so, we can meet and review the code.

-Mat

Hi Mat,

thanks for clarifying the bug tracking issue, I perfectly understand.

Regarding this issue, I probably wasn’t clear enough about how I do multi-GPU, but when I was referring to “rank”, I did mean MPI rank, i.e. we are using MPI for our multi-GPU implementation and each MPI process “talks” to exactly one GPU device, which is set at the beginning of the simulation. With 17.10 everything was fine, i.e. we could reproduce the CPU results.
We only use multi-threading outside of OpenACC code section and we don’t call any OpenACC code withing OpenMP sections. So, I think we can rule out any threads related issues as only the master thread of each MPI process should process OpenACC code.
With setting PGI_ACC_NOTIFY I also see that both devices are getting data transferred to them. Without the existence of those arrays on both devices the simulation should abort right from the beginning and not at the end when reading out the data.
When I call acc_present_dump for each MPI process right after I pcreate the arrays on the device, I don’t see anything suspicious.

Unfortunately, I won’t be able to make it to SC18 next week.

I will try to thoroughly verify all my observations and then get back to you directly if setting up a code review seems to be the most efficient way to sort this out.

Thanks,
LS

Hi Mat,

I have investigated a little further and found something strange.

After figuring out what accelerators are available to the various processes running on either the same or different nodes, I run the following code to assign each process its unique device:

#ifdef _OPENACC
    MPI::COMM_WORLD.Scatter(&proc_accelerator_id[0],1,MPI::INT,&acc_dev_id,1,MPI::INT,0);
    if(MPI::COMM_WORLD.Get_rank()==0){
      LogFile << "Accelerator (GPU) IDs: ";
      for(size_t i=0;i<proc_accelerator_id.size();++i){
	LogFile << proc_accelerator_id[i] << " ";
      }
      LogFile << "\n";
    }
    MPI::COMM_WORLD.Barrier();
    if(acc_dev_id<num_devs){
      acc_set_device_num(acc_dev_id,dev_type);
#pragma acc kernels
 {}
#ifdef _DEF_PGI_ACCEL
      char device_name[32];
      acc_get_current_device_name(device_name,32);
#endif
      std::cout << "Accelerator (GPU) device number that will be used for process "
	  << pid << ": " << acc_get_device_num(dev_type)
#ifdef _DEF_PGI_ACCEL
	  << " (" << device_name << ")"
#endif
	  << "\n";

    }else{
      EMS_Error("Error: Could not set accelerator device ID");
    }
    if(MPI::COMM_WORLD.Get_rank()==0){
      check_openacc_supported_input();
    }
    //acc_init(dev_type);
    MPI::COMM_WORLD.Barrier();
#endif

This is the only place in the entire code where I use acc_set_device_num().
The output is

<hostname>   <#procs per host>  (<pid1>, <pid2>, ... )  <#accelerators per host>
tesla-dell-lnx  2  (143028, 143029)  2                                          
Accelerator (GPU) IDs: 0 1                                                                                                                

Accelerator (GPU) device number that will be used for process 143028: 0 (Tesla V100-PCIE-32GB)
Accelerator (GPU) device number that will be used for process 143029: 1 (Tesla P100-PCIE-16GB)

When I look at nvidia-smi output I see

Every 10.0s: nvidia-smi                                                                                                                                                                                              Thu Nov  8 16:53:49 2018

Thu Nov  8 16:53:49 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:04:00.0 Off |                    0 |
| N/A   30C    P0    38W / 250W |   2479MiB / 32510MiB |     12%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE...  On   | 00000000:83:00.0 Off |                    0 |
| N/A   29C    P0    32W / 250W |    305MiB / 16280MiB |     35%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    143028      C   ../../../../bin/s-gpu                     424MiB |
|    0    143029      C   ../../../../bin/s-gpu                     410MiB |
|    1    143029      C   ../../../../bin/s-gpu                     295MiB |

So, nvidia-smi is indicating that MPI process with rank 1 is actually running/accessing both GPUs!

Present table dumps at the beginning seem to indicate that everything is as it should be, i.e. rank 0 dumps a table for GPU 0 and rank 1 dumps a table for GPU 1:

Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1



Present table dump for device[2]: NVIDIA Tesla GPU 1, compute capability 6.0, threadid=1

However, at the end of the program when I call acc update self, both ranks give the same output for acc_present_dump, i.e. they both refer to the same device:

Present table dump for device[0]: , threadid=0                                                   
...empty...

I have also output the addresses of the host arrays for both ranks and they differ. I also put the acc update self in an if(rank==0) or if(rank==1) statement to further check that really different ranks execute this statement and give the same output.

I am puzzled how the host-device connection can change during execution if I don’t call acc_set_devIce_num() more than once for each rank.

Any thoughts? Or do you think it is time for you to take a look into the code?

Thanks,
LS

Hi LS,

Here’s the code I typically use for device assignment. It uses a MPI-3 feature to determine which ranks are local on a node, and then round-robins the assignment. It’s written in C, so you may need to make some adjustments for your code (if you choose to use it)

#if defined(_OPENACC)
    int num_devices;
    int gpuId;
    MPI_Comm shmcomm;
    int local_rank;

    MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &shmcomm);
    MPI_Comm_rank(shmcomm, &local_rank);
   my_device_type = acc_get_device_type();
   num_devices = acc_get_num_devices(my_device_type);
   gpuId = local_rank % num_devices;
   acc_set_device_num(gpuId, my_device_type);
#endif

However, I doubt your device selection is the root cause.

So, nvidia-smi is indicating that MPI process with rank 1 is actually running/accessing both GPUs!

Yes and no. I’m assuming that you’re using a CUDA Aware MPI. In some cases, when MPI initializes, it will create a context on the default device (in this case device 0). Other than wasting memory and causing problems with MPS at high rank counts, it shouldn’t matter. When you make the device selection later, the rank should move to other device but does retain the context on the default device.

Now, I’ve not used a system that mixes different devices on the same system. I wouldn’t think that this would be a problem, but it’s possible,

You can work around this by instead of making the device selection in the program, set the environment variable “CUDA_VISIBLE_DEVICES” to the device. I’d recommend writing a wrapper script which sets this per rank.

Note, that this is not a long term solution but we can see if the problem persist and thus eliminate the possibility if the mixed devices with the context being created on the default device is the problem. The problem with using CUDA_VISIBLE_DEVICES, is that in order for CUDA Aware MPI to work, it needs all devices visible, thus you’ll loose out on the GPUs being able to directly transfer data to each other.

Present table dump for device[0]: , threadid=0

What’s odd to me is that the earlier present dumps enumerated the devices as 1 and 2. But here, it’s device 0. Why the change in enumeration? Unfortunately, I don’t have an answer.

One speculation is that each OpenMP thread will have their own device assignment. Each thread should inherit the device number from the master thread upon creation, but perhaps this isn’t working as expected, hence reverting to the default device? I’ll look into this today and see what I can determine. Though, one thing to try is retaining the rank’s device number (when not using CUDA_VISIBLE_DEVICES) and in an OpenMP parallel region, have each thread set the device number. No idea if this is the root cause, but just trying to think of possibilities.

-Mat

Hi Mat,

thanks for the feedback, context, and for sharing the code snippet for the device selection, which looks very elegant.

Just a few quick remarks about the points you raised:

  1. The same problem also shows up on a node with two K20Xm, so that probably rules out different types of GPUs causing this issue.
  2. Unfortunately, we are still using MPICH, i.e. not CUDA aware MPI, but it is on the to-do-list.
  3. I will try your suggestion about selecting the device for each rank using the environment variable instead acc_set_device_num in the code.
  4. I will also try out explicitly setting the device num for each thread of each rank to see whether that changes the picture.
  5. The enumeration is indeed confusing and also why the output of acc_present_dump changes. At the beginning of the simulation it mentions device[1] and device[2], and also reports “NVIDIA Tesla GPU 0/1, compute cpability 6/7.0, threadid=1” whereas later it reports device[0] and threadid=0 (instead of 1) and also doesn’t mention “NVIDIA Tesla GPU …”. Maybe the different output of acc_present_dump can help us understand the context in which it was called. Could device[0] be the host (i.e. default device) and therefore an empty present table is reported?

Thanks,
LS

Hi Mat,

regarding 4), I put acc_set_device_num just before the acc update self clause to explicitly select the correct device num for each rank and then I get the CPU/single-GPU results again without any crash and table dump. Calling acc_get_device_num(acc_device_nvidia) just before explicitly setting it to the correct one, returns 0 for both ranks.
The behavior does not depend on whether I use 1 or more OpenMP threads. If I only have 1 thread it shouldn’t be any issue with inheriting the device num, right?

Thanks,
Lutz

If I only have 1 thread it shouldn’t be any issue with inheriting the device num, right?

Correct. This was probably a incorrect tact on my part. My internal tests show that mixing and MPI code with OpenMP and OpenACC in the same code, is working as expected.

Can you instrument the code to add periodic calls to “acc_get_device_num” to determine when the device number changes?

Does the code call any CUDA libraries or routines? If so, could these be resetting the device number?

Is this the “update” in a shared library? If so, maybe it’s linked against a different version of the compiler’s runtime libraries?

-Mat

Hi Mat,

no CUDA libraries or routines are used, just OpenACC pragmas and API calls. None of the OpenACC code gets packed into a shared library.

I will instrument the code with acc_get_device_num and hopefully be able to figure out the point where the device number changes by using bisection. As the results are correct when I explicitly set the correct device number before the update call, I am assuming that everything is still fine during the time-stepping, i.e. the actual computation of the physics. During the time-stepping I need to exchange halo data and as I am not yet using CUDA aware MPI implementation it has to go through the host. This implies update calls before and after the respective MPI send/receive calls. If the device numbers were wrong during that stage then I should probably end up with garbage at the end (more likely the program would have aborted with a similar error as it does now at the end).

As you may remember, we are mixing g++ compiled code with pg++ compiled code and call pgi for the link step. As you mentioned some time ago when I was running into odd problems, that mixing OpenMP runtime of g++ and pg++ wasn’t a good idea, I made sure to compile all files containing OpenMPor OpenACC statements with pg++. After having done that all previous issues disappeared. So, I hope this is still a safe approach.

Thanks,
LS

Note with our LLVM back-end (-Mllvm), the GNU OpenMP is easier to inter-operate since we now have hooks to the GOMP runtime. I can’t guarantee it’s completely inter-operable so keeping them separate is still fine, but you can try mixing them.

-Mat

Hi Mat,

I have investigated this issue further and figured out the problem, which, given one of your earlier comments, might be a compiler bug if I understood correctly.

First, disabling device selection in the code and starting each MPI process explicitly using the following command for a dual GPU simulation:

mpiexec.hydra -n 1 env CUDA_VISIBLE_DEVICES="0" <myprogram> : -n 1 env CUDA_VISIBLE_DEVICES="1" <myprogram>

I observe only one process per GPU ( so rank 1 doesn’t create a context on the default device) and my program runs without any errors, i.e. acc update self/host finds the right device for each MPI rank.

Tracking down where rank 1 is assigned to device 0 (default) instead 1, I came across OpenMP code that I had forgotten about. To optimize IO I had introduced two sections in a OpenMP parallel region. It turns out that the second section gets assigned the default device number, i.e. it doesn’t inherit the device number from the master thread. As I understood you earlier, it probably should.
There are two workarounds, one is two change the order of the OpenMP sections (from an OpenMP point of view this should be identical) and the other one is to read the device number before entering the parallel region and then set it explicitly in the sections.
I chose the latter because it documents the intent/problem even without any comments along with it.

Not sure whether this issue only occurs if nesting is enabled, so for completeness I summarize the full flow below:

omp_set_nested(1);
omp_set_max-active_levels(2);
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
do_something()
}
#pragma omp section
{
omp_set_num_threads(std::max(1,omp_get_max_threads()-1));
#pragma omp parallel
{
#pragma omp single
{
only_one_thread_is_supposed_to_do_this();
}
call_function_containing_omp_for_to_parallelize_for_loop();
omp_set_num_threads(omp_get_max_threads()+1);
}
}
omp_set_nested(0);

So your intuition was right, i.e. something related to OpenMP threads not being assigned the right device. But since I insisted that I am not calling any acc code from within an OpenMP region because I had forgotten about this single location, I didn’t help you much … sorry.

Can I expect a future PGI version to properly inherit the device number in parallel regions?

Thanks,
LS