Error: internal error: invalid thread id

pinnacleman98 · August 6, 2020, 10:08pm

Hi, all:

I created a DLL written in cuda fortran and OpenACC and exported the function I want to call from C++/Python with c binding. I tried to test the dll from visual studio 2013 with my existing test program (used to test CPU version of the dll created by intel fortran compiler). But the test code gave me following error:

I checked that this error happened with the first openacc kernel to initialize device array. After some research, I found following thread which could be related to my problem

Specifically what steps need to be done to link a VC++ cpp file to a dll created with PGI fortran compiler with cuda fortran and openacc?

My development platforms:
PGI fortran compiler community version 19.10
cuda 10.1
Windows 10
Visual studio 2013

dll linking setting:
LINKLIBS=–keeplnk -Xlinker /FORCE:MULTIPLE -lcudaforblas -lcusolver -lcusparse -lmkl_core_dll -lmkl_intel_lp64 -lmkl_intel_thread_dll -lmagma
DLLLINKLIBS = “C:\Program Files\PGI/win64/19.10/lib/acc_init_link_cuda.obj” “C:\Program Files\PGI/win64/19.10/lib/libaccapi.lib” “C:\Program Files\PGI/win64/19.10/lib/libaccg.lib” “C:\Program Files\PGI/win64/19.10/lib/libaccnc.lib” “C:\Program Files\PGI/win64/19.10/lib/libaccg2.lib” “C:\Program Files\PGI/win64/19.10/lib/libcudadevice.lib” “C:\Program Files\PGI/win64/19.10/lib/pgc.lib” “C:\Program Files\PGI/win64/19.10/lib/libnspgc.lib”

Thanks in advance

John

pinnacleman98 · August 7, 2020, 3:02am

Just a few updates.

If I build my cuda fortran and openacc code into a static library and link my test code to the library, it can produce correct result.
I tried a c version of my test code and compiled with pgcc, it gives me same error message, invalid thread id.

Thanks

John

MatColgrove · August 7, 2020, 2:04pm

Hi John,

First, you’ll need to compile the CUDA Fortran and OpenACC code without RDC (Relocatable Device Code) support (-ta=tesla:nordc -Mcuda=nordc). RDC requires a link step which can only be performed on static objects and libraries. DLLs require runtime linking which currently is not available. Unfortunately this means that features that require linking, such as calling device routines or using device module variables found is external files and modules, can’t be used.

Second, you’ll need to add a “DllMain” to your DLL. This routine gets implicitly call upon load of the DLL so we can initialize the compiler runtime. This lack of initialization is why you’re getting the “invalid thread id” error.

Below is an example of the DllMain that I’ve used successfully with 19.10. Though it does require some configuration depending on the code. I’ve uncommented the three preinit routines I think you need. Note that since this is C++ code, you’ll want to compile it using the Microsoft C++ compiler. Also, the exact initialization routines can change so for future readers, the routine names may need adjustment.

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include "test_acc.h"
#include "utils_acc.h"

extern "C"
{
void __setchk(long*,size_t,size_t);
void _mp_preinit(void);
void __pgi_acc_preinit(void);
void __pgi_uacc_set_link_multicore(void);
void __pgi_uacc_set_link_cuda(void);
void __pgi_ctrl_init();
}
BOOL WINAPI DllMain(
HINSTANCE hinstDLL, // handle to DLL module
DWORD fdwReason, // reason for calling function
LPVOID lpReserved ) // reserved
{
// Perform actions based on the reason for calling.
switch( fdwReason ) 
{ 
case DLL_PROCESS_ATTACH:
  long n;

//  printf("Calling setchk\n");
//  __setchk(&n+256+128*1024,0,0);
//  printf("Calling acc_preinit\n");
  __pgi_acc_preinit();
//  printf("Calling mp_preinit\n");
  _mp_preinit();
//  __pgi_uacc_set_link_multicore();
//  __pgi_ctrl_init();
//  printf("Calling set link cuda\n");
  __pgi_uacc_set_link_cuda();
break;

case DLL_THREAD_ATTACH:
break;

case DLL_THREAD_DETACH:
break;

case DLL_PROCESS_DETACH:
break;
}

return TRUE; // Successful DLL_PROCESS_ATTACH.
}

Let me know if you have any issues with this, and we can try to work through them.

-Mat

pinnacleman98 · August 7, 2020, 3:54pm

Hi. Mat:

Thanks for your prompt reply and sample dllmain function. So it looks like that even with PGI compiler switch -Bdynamic and _-Mmakedll, the compiler doesn’t automatically create this DllMain function when creating dll.

The question is my dll is compiled from Fortran code by PGI Fortran compiler. I’m not sure how to effectively mix this C++ Dllmain function into my Fortran code. I can think of two possible ways to do it. One is first creating a static library with all my cuda fortran and openacc code, then creating a dll in visual studio that links against this static library. This may require me to create a thin C/C++ wrapper for my Fortran functions to be called from C/C++ side. The other way is creating a static library in visual studio with only this DllMain function and link against this static library when creating final dll with PGI Fortran compiler. Base on your experience, which approach do you think more suitable to solve this issue?

Thanks in advance.

John

MatColgrove · August 7, 2020, 6:31pm

Hi John,

You shouldn’t need to write any wrapper code for use with DllMain. Instead, you should be able to simply compile it with a C++ compiler, like cl, and then include it with your object. It doesn’t need to call anything in your program, only the compiler runtime initialization routines which is written in C.

Now, I am using a lot “shoulds” here. That because I’ve not used cl in this context so haven’t tried this. The use case I worked on was for a customer that uses Intel C++ for the majority of the DLL, but PGI OpenACC for several key compute kernels, so used icl and xilink as the linker. In theory using cl should work fine as well, but having not specifically used it in this exact context, there may be some issue that need to be worked through.

For reference, here’s the build script I used for my toy example. This is not what you should use, but more just an example of what could be done.

pgcc -c -Mdll -tp=px -m64 -acc -ta=tesla:nordc,cc70 -Minfo=accel test_acc.c utils_acc.c
icl /debug -c myclass.cpp test_dll.cpp
xilink myclass.obj utils_acc.obj test_acc.obj test_dll.obj /out:myclass.dll -nologo -dll -incremental:no "-libpath:C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/VC/Tools/MSVC/14.16.27023/lib/x64" "-libpath:C:/Program Files (x86)/Windows Kits/10/Lib/10.0.17763.0/ucrt/x64" "-libpath:C:/Program Files (x86)/Windows Kits/10/Lib/10.0.17763.0/um/x64" -libpath:C:\PROGRA~1\PGI/win64/19.1/lib -defaultlib:libaccapi -defaultlib:libaccg -defaultlib:libaccn -defaultlib:libaccg2 -defaultlib:libcudadevice -defaultlib:ws2_32.lib -defaultlib:libpgmp -nodefaultlib:libvcruntime -nodefaultlib:libucrt -nodefaultlib:libcmt -defaultlib:msvcrt -defaultlib:pgc -defaultlib:libpgmath -defaultlib:pgmisc -defaultlib:libnspgc -defaultlib:legacy_stdio_definitions -defaultlib:oldnames /DYNAMICBASE:NO
 icl /debug main.c /link /DYNAMICBASE:NO

Note “test_dll.cpp” is where the DllMain routine is located.

Also, the above build script was for the 19.1 compilers. It would need adjustment for use with 19.10.

-Mat

pinnacleman98 · August 7, 2020, 10:37pm

Hi, Mat:

Thanks for your help. I compiled your sample dllmain with cl and the problem of invalid thread id is gone. But now there is another problem. Mycode quits when getting to the point of first calling cusolver functions. So it seems that the cusolver library is not properly initialized. Do you know how to initialize both cublas and cusolver in dllmain? Or it could be that I linked to wrong cusolver library for dll creation. If so, do you know what is the right library to link against for cublas and cusolver in dll?

Thanks,

John

pinnacleman98 · August 9, 2020, 7:53pm

Hi, Mat:

I figured the problem out. I accidentally commented the cublasCreate and cusolverDnCreate for debugging purpose. After uncommenting these two lines of code, I can call cusolve functions now.

However after several times call of cusolve function (cusolverDnZgetrf and cusolverDnZgetrs) my code gives me following error

0: DEALLOCATE: an illegal memory access was encountered
Failing in Thread:1
call to cuCtxSetCurrent returned error 4: Deinitialized.

Note when I restart the PC and start over, the first time run of my code can run to the end, but it gives wrong result and above error too at the end of the run. Then whenever I run my code, it always gives me this error and no result at all.

I really don’t know what’s going wrong here. Please help.

Thanks,

John

MatColgrove · August 10, 2020, 1:35pm

I figured the problem out. I accidentally commented the cublasCreate and cusolverDnCreate for debugging purpose.

Glad you were able to figure it out. I was going to say that unlike the compiler runtime that initializes on load, for libraries you typically need to call the library’s initialization routines, the “Create” calls in this case.

Typically when I see an error like this, the problem is not with the deallocate itself, but the kernel that was launched just before this. Errors sometimes don’t show-up until the next use of the device.

What kernel is called before the failing deallocate? Is it a CUDA Fortran kernel or a call to cublas/cusolver?

-Mat

pinnacleman98 · August 11, 2020, 5:06pm

Hi, Mat:

Thanks for your help. I found the problematic kernel. But now the code gives me following error:

TestFailing in Thread:1
call to cuCtxSetCurrent returned error 4: Deinitialized

I searched web and didn’t find any information on this error. Do you know what could cause this error?

Thanks,

John

MatColgrove · August 11, 2020, 7:17pm

Hi John,

No idea why it’s occurring but I believe this error means that a kernel tried running on a CUDA context after the context was destroyed.

-Mat

pinnacleman98 · August 11, 2020, 8:08pm

Hi, Mat:

I think that the problem could be related to the thread synchronization because I added some print out statements into my code and the code sometimes gives me correct results, sometime wrong results, though the problem is still there. In OpenMP there is an implicit thread barrier at the end of each parallel loop region. I assumed that it is same in OpenACC. But I read a webpage saying that there is no implicit thread barrier at the end of OpenACC parallel loops. If no thread barrier, then how I can synchronize the threads?

Thanks,

John

MatColgrove · August 11, 2020, 8:20pm

There’s an implicit barrier at the end of a compute region (parallel or kernels) unless the user has added the “async” clause. Gang loops within a parallel compute region may not add an implicit barrier (though vector and worker loops do), which may be what you’re thinking of. Though if you have link to the page, I can see what it says to correct or confirm.

pinnacleman98 · August 11, 2020, 10:36pm

Hi, Mat:

Thanks for your information. I think the problem is in the cublas and cusolver calls. From this forum I learned that they are non-blocking, which is the cause of my problem. How do I synchronize or let the host code wait until the last cublas/cusolver call finishes?

Thanks,

John

pinnacleman98 · August 12, 2020, 3:53am

Hi, Mat:

I solved the non-blocking issue of cublas and cusolver calls by adding cudaDeviceSynchronize after these calls.

Also I’m pretty sure now the error of :

TestFailing in Thread:1
call to cuCtxSetCurrent returned error 4: Deinitialized

is occurred during the quitting process of the dll. I called my function in dll several times with same arguments, every time the function gives correct result (I printed out the results to the screen). And the error shows up at the very end of screen after all function calls.

I think that I could miss some dll finalization functions for GPU.

Anyway thanks for your consistent assistance on my GPU journey.

John

Topic		Replies	Views
Problem with NVFORTRAN and R nvc, nvc++ and nvfortran	46	2883	April 25, 2024
Unresolved external symbol with OpenACC (fortran) Legacy PGI Compilers	3	2722	November 30, 2017
OpenACC kernels in Windows DLLs Legacy PGI Compilers	14	7696	January 6, 2016
host_data fails for dynamically loaded library Legacy PGI Compilers	3	1092	May 12, 2020
Compiling OpenACC and Fortran for R on Windows Legacy PGI Compilers	5	3000	August 24, 2018
No Available accelerator Legacy PGI Compilers	7	6568	November 9, 2016
Error running GPU code as DLL Legacy PGI Compilers	2	6061	September 5, 2014
undefined reference to `__pgi_uacc_computestart' Legacy PGI Compilers	8	7673	June 14, 2018
cublas part 2 Legacy PGI Compilers	2	761	September 3, 2019
error for a simple OPENACC program Legacy PGI Compilers	23	11869	May 16, 2013

Error: internal error: invalid thread id

Related topics