Segmentation fault in pthread_mutex_lock ()

Hi,

I get a segmentation fault in pthread_mutex_lock () when trying to run my openacc code on gpu. On the cpu it runs fine. The segmentation fault happens right at the first acc data copy directive. Do you have any idea what might be wrong ? I can provide further information or access to the code if necessary.

Thank you and regards,
Thomas

Thread 1 “ftg_vdiff_up_te” received signal SIGSEGV, Segmentation fault.
0x00002aaabcce7714 in pthread_mutex_lock () from /lib64/libpthread.so.0
Missing separate debuginfos, use: zypper install libjasper1-debuginfo-1.900.14-195.3.1.x86_64 libjpeg62-debuginfo-62.1.0-30.1.x86_64 libjpeg8-debuginfo-8.0.2-30.3.x86_64 liblzma5-debuginfo-5.0.5-4.852.x86_64 libnuma1-debuginfo-2.0.9-9.1.x86_64 libpython2_7-1_0-debuginfo-2.7.13-27.1.x86_64 libxml2-2-debuginfo-2.9.4-46.3.2.x86_64 libz1-debuginfo-1.2.8-11.1.x86_64
bt
(gdb) #0 0x00002aaabcce7714 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1 0x00002aaaacdaff88 in ?? ()
from /usr/lib64/gcc/x86_64-suse-linux/4.8/…/…/…/…/lib64/libcuda.so.1
#2 0x00002aaaace66471 in ?? ()
from /usr/lib64/gcc/x86_64-suse-linux/4.8/…/…/…/…/lib64/libcuda.so.1
#3 0x00002aaaace665e5 in ?? ()
from /usr/lib64/gcc/x86_64-suse-linux/4.8/…/…/…/…/lib64/libcuda.so.1
#4 0x00002aaaacdb5eb4 in ?? ()
from /usr/lib64/gcc/x86_64-suse-linux/4.8/…/…/…/…/lib64/libcuda.so.1
#5 0x00002aaaacdb7707 in ?? ()
from /usr/lib64/gcc/x86_64-suse-linux/4.8/…/…/…/…/lib64/libcuda.so.1
#6 0x00002aaaacd8a266 in ?? ()
from /usr/lib64/gcc/x86_64-suse-linux/4.8/…/…/…/…/lib64/libcuda.so.1
#7 0x00002aaaacdd79ed in cuInit ()
from /usr/lib64/gcc/x86_64-suse-linux/4.8/…/…/…/…/lib64/libcuda.so.1
#8 0x00002aaaac9a9dd5 in ?? ()
from /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0
#9 0x00002aaaac9a9e31 in ?? ()
from /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0
#10 0x00002aaabcce3c13 in __pthread_once_slow () from /lib64/libpthread.so.0
#11 0x00002aaaac9dc919 in ?? ()
from /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0
#12 0x00002aaaac9a600a in ?? ()
from /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0
#13 0x00002aaaac9a9ceb in ?? ()
from /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0
#14 0x00002aaaac9cbd2a in cudaFree ()
from /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0
#15 0x00002aaaae61aeaf in __pgi_uacc_cuda_initdev ()
from /apps/common/UES/pgi/17.10/linux86-64/17.10/lib/libaccncmp.so
#16 0x00002aaaae402eaa in __pgi_uacc_enumerate ()
from /apps/common/UES/pgi/17.10/linux86-64/17.10/lib/libaccgmp.so
#17 0x00002aaaae4033c3 in __pgi_uacc_initialize ()
from /apps/common/UES/pgi/17.10/linux86-64/17.10/lib/libaccgmp.so
#18 0x00002aaaae3f9e3b in __pgi_uacc_dataenterstart ()
from /apps/common/UES/pgi/17.10/linux86-64/17.10/lib/libaccgmp.so
#19 0x000000000041f0fd in mo_vdiff_upward_sweep::vdiff_up (
kproma=, kbdim=, klev=,
klevm1=, ktrac=, ksfc_type=,
idx_wtr=, pdtime=, pfrc=…, pcfm_tile=…,
aa=…, pcptgz=…, pum1=…, pvm1=…, ptm1=…, pmair=…, pmdry=…,
pqm1=…, pxlm1=…, pxim1=…, pxtm1=…, pgeom1=…, pztkevn=…,
bb=…, pzthvvar=…, pxvar=…, pz0m_tile=…, pkedisp=…, pute_vdf=…,
pvte_vdf=…, pq_vdf=…, pqte_vdf=…, pxlte_vdf=…, pxite_vdf=…,
pxtte_vdf=…, pz0m=…, pthvvar=…, ptke=…, psh_vdiff=…,
pqv_vdiff=…) at …/…/…/src/mo_vdiff_upward_sweep.f90:141
#20 0x000000000040d629 in ftg_test_vdiff_up ()


==24456== Invalid read of size 4
==24456== at 0x16E4E714: pthread_mutex_lock (in /lib64/libpthread-2.22.so)
==24456== by 0x6F16F87: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6FCD470: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6FCD5E4: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6F1CEB3: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6F1E706: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6EF1265: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6F3E9EC: cuInit (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6B10DD4: ??? (in /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0.44)
==24456== by 0x6B10E30: ??? (in /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0.44)
==24456== by 0x16E4AC12: __pthread_once_slow (in /lib64/libpthread-2.22.so)
==24456== by 0x6B43918: ??? (in /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0.44)
==24456== Address 0x3038 is not stack’d, malloc’d or (recently) free’d
==24456==
==24456==
==24456== Process terminating with default action of signal 11 (SIGSEGV)
==24456== Access not within mapped region at address 0x3038
==24456== at 0x16E4E714: pthread_mutex_lock (in /lib64/libpthread-2.22.so)
==24456== by 0x6F16F87: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6FCD470: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6FCD5E4: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6F1CEB3: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6F1E706: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6EF1265: ??? (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6F3E9EC: cuInit (in /usr/lib64/libcuda.so.375.74)
==24456== by 0x6B10DD4: ??? (in /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0.44)
==24456== by 0x6B10E30: ??? (in /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0.44)
==24456== by 0x16E4AC12: __pthread_once_slow (in /lib64/libpthread-2.22.so)
==24456== by 0x6B43918: ??? (in /apps/common/UES/pgi/17.10/linux86-64/2017/cuda/8.0/lib64/libcudart.so.8.0.44)

Hi Thomas,

It looks like the error is occurring when the runtime is trying to initialize your device. Unfortunately, I have not seen this error before so am not sure what’s causing it, though my guess would be that it’s some type of installation or configuration issue with your device.

What CUDA driver and device do you have installed? (you can see this by running “nvidia-smi” or “pgaccelinfo”)

Are you able to run a simple CUDA C program?

-Mat

I compile and link with -ta=nvidia:cc60,cuda8.0 -Mcuda which works for other simple openACC programs.

When using the cray compiler I don’t get any error.

This is the output:

CUDA Driver Version: 8000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.74 Wed Jun 14 01:39:39 PDT 2017

Device Number: 0
Device Name: Tesla P100-PCIE-16GB
Device Revision Number: 6.0
Global Memory Size: 17066885120
Number of Multiprocessors: 56
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1328 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: exclusive-process
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 715 MHz
Memory Bus Width: 4096 bits
L2 Cache Size: 4194304 bytes
Max Threads Per SMP: 2048
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60

Thank you!

Great thanks! So the error must be something to do with the program itself and not the device.

Can you try adding a “acc_init(acc_get_device_type())” call in the main program? This will move the device initialization early in the run and we can see if it’s actually a problem with the device initialization or if the issue is with the first kernel launch.

Also, would I be able to get a reproducing example of the code that I can try? If so, please either post or send information to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me.

-Mat

calling acc_get_device_type() right at the beginning already results in a segmentation fault.

I’ll see if I can provide you an example code.

I just got a very similar error, if not the same.

Here is the backtrace of the error

#0  0x00007ffff7bc8c30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00007fff3a7d4d0a in cudnn::ops::GetInternalStreams(cudnnContext*, int, CUstream_st**) ()
   from /home/edreis/.local/cuda/lib64/libcudnn_ops_infer.so.8
#2  0x00007fff29575742 in cudnn::cnn::PrecomputedGemmEngine<cudnn::cnn::convolve_launch_pg_pf<float, float, float, float>, 5, 2, 2>::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) ()
   from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#3  0x00007fff292f59e1 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*)
    () from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#4  0x00007fff29320915 in cudnn::cnn::EngineContainer<(cudnnBackendEngineName_t)1, 4096ul>::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) () from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#5  0x00007fff292f59e1 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*)
    () from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#6  0x00007fff293c77b0 in cudnn::cnn::AutoTransformationExecutor::execute_pipeline(cudnn::cnn::EngineInterface&, cudnn::backend::VariantPack const&, CUstream_st*) const ()
   from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#7  0x00007fff293c7a0f in cudnn::cnn::BatchPartitionExecutor::operator()(cudnn::cnn::EngineInterface&, cudnn::cnn::EngineInterface*, cudnn::backend::VariantPack const&, CUstream_st*) const ()
   from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#8  0x00007fff293d8585 in cudnn::cnn::GeneralizedConvolutionEngine<cudnn::cnn::EngineContainer<(cudnnBackendEngineName_t)1, 4096ul> >::execute_internal_impl(cudnn::backend::VariantPack const&, CUstream_st*) ()
   from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#9  0x00007fff292f59e1 in cudnn::cnn::EngineInterface::execute(cudnn::backend::VariantPack const&, CUstream_st*)
    () from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#10 0x00007fff29309b2f in cudnn::backend::execute(cudnnContext*, cudnn::backend::ExecutionPlan&, cudnn::backend::VariantPack&) () from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#11 0x00007fff294b38d9 in cudnn::backend::EnginesAlgoMap<cudnnConvolutionFwdAlgo_t, 8>::execute_wrapper(cudnnContext*, cudnnConvolutionFwdAlgo_t, cudnn::backend::ExecutionPlan&, cudnn::backend::VariantPack&) ()
   from /home/edreis/.local/cuda/lib64/libcudnn_cnn_infer.so.8
#12 0x00007fff294afff3 in cudnn::backend::convolutionForward(cudnnContext*, void const*, cudnnTensorStruct const*, void const*, cudnnFilterStruct const*, void const*, cudnnConvolutionStruct const*, cudnnConvolutionFwdAlgo_t, voi

I am using a remote machine, in which I don’t have admin privileges. Since I needed cudnn to compile a torch c++ extension, I had to untar cudnn-linux-x86_64-8.4.1.50_cuda10.2-archive.tar.xz to /home/edreis/.local/cuda/.

I am compiling the extension in the following way

import os                                                                       
import torch                                                                                                                            
from torch.utils.cpp_extension import load                                      
from pathlib import Path

   
__cpp_ext__ = None                                                                 
def __lazzy_load__(verbose=False):                                                 
  global __cpp_ext__                                                               
  if __cpp_ext__ is None:                                                          
    __parent__ = Path(__file__).absolute().parent                                  
    __cpp_ext__ = cudnn_convolution = load(                                        
      name="cudnn_convolution",                                                    
      sources=[f"{__parent__}/cudnn_convolution.cpp", f"{__parent__}/cudnn_utils.cpp"],                           
      extra_include_paths = ["/home/edreis/.local/cuda/include"],                  
      extra_ldflags = [                                                            
          "-L/home/edreis/.local/cuda/lib64",                                      
          "-lcudnn",                                                               
          "-lnvToolsExt"],                                                         
      with_cuda=True,                                                              
      verbose=verbose                                                              
    )                                                                              
    if verbose:                                                                    
      print(f"{os.path.basename(__file__)}: Cpp CuDNN Extension Compiled and Loaded!")
  return __cpp_ext__    

This is the output of nvidia-smi

(base) [edreis@compute-0-2 PyTorch-cuDNN-Convolution]$ nvidia-smi 
Tue Jun 28 22:46:00 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P40           Off  | 00000000:3B:00.0 Off |                    0 |
| N/A   23C    P0    48W / 250W |      0MiB / 22919MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Any advice?

Hi eduardo4jesus

te85kibe never got back to me with an example or posted a follow-up so I don’t know what the conclusion was.

This forum is for the HPC Compilers and directive based GPU programming so I would normally move your question over to another forum, but since you did a follow-up post rather than creating a new topic, I can’t.

What I’d suggest is posting this question over on the cuDNN forum (cuDNN - NVIDIA Developer Forums) since they should have better insight in this area.

-Mat