I created a DLL written in cuda fortran and OpenACC and exported the function I want to call from C++/Python with c binding. I tried to test the dll from visual studio 2013 with my existing test program (used to test CPU version of the dll created by intel fortran compiler). But the test code gave me following error:
Error: internal error: invalid thread id
I checked that this error happened with the first openacc kernel to initialize device array. After some research, I found following thread which could be related to my problem
Specifically what steps need to be done to link a VC++ cpp file to a dll created with PGI fortran compiler with cuda fortran and openacc?
My development platforms:
PGI fortran compiler community version 19.10
cuda 10.1
Windows 10
Visual studio 2013
First, you’ll need to compile the CUDA Fortran and OpenACC code without RDC (Relocatable Device Code) support (-ta=tesla:nordc -Mcuda=nordc). RDC requires a link step which can only be performed on static objects and libraries. DLLs require runtime linking which currently is not available. Unfortunately this means that features that require linking, such as calling device routines or using device module variables found is external files and modules, can’t be used.
Second, you’ll need to add a “DllMain” to your DLL. This routine gets implicitly call upon load of the DLL so we can initialize the compiler runtime. This lack of initialization is why you’re getting the “invalid thread id” error.
Below is an example of the DllMain that I’ve used successfully with 19.10. Though it does require some configuration depending on the code. I’ve uncommented the three preinit routines I think you need. Note that since this is C++ code, you’ll want to compile it using the Microsoft C++ compiler. Also, the exact initialization routines can change so for future readers, the routine names may need adjustment.
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include "test_acc.h"
#include "utils_acc.h"
extern "C"
{
void __setchk(long*,size_t,size_t);
void _mp_preinit(void);
void __pgi_acc_preinit(void);
void __pgi_uacc_set_link_multicore(void);
void __pgi_uacc_set_link_cuda(void);
void __pgi_ctrl_init();
}
BOOL WINAPI DllMain(
HINSTANCE hinstDLL, // handle to DLL module
DWORD fdwReason, // reason for calling function
LPVOID lpReserved ) // reserved
{
// Perform actions based on the reason for calling.
switch( fdwReason )
{
case DLL_PROCESS_ATTACH:
long n;
// printf("Calling setchk\n");
// __setchk(&n+256+128*1024,0,0);
// printf("Calling acc_preinit\n");
__pgi_acc_preinit();
// printf("Calling mp_preinit\n");
_mp_preinit();
// __pgi_uacc_set_link_multicore();
// __pgi_ctrl_init();
// printf("Calling set link cuda\n");
__pgi_uacc_set_link_cuda();
break;
case DLL_THREAD_ATTACH:
break;
case DLL_THREAD_DETACH:
break;
case DLL_PROCESS_DETACH:
break;
}
return TRUE; // Successful DLL_PROCESS_ATTACH.
}
Let me know if you have any issues with this, and we can try to work through them.
Thanks for your prompt reply and sample dllmain function. So it looks like that even with PGI compiler switch -Bdynamic and _-Mmakedll, the compiler doesn’t automatically create this DllMain function when creating dll.
The question is my dll is compiled from Fortran code by PGI Fortran compiler. I’m not sure how to effectively mix this C++ Dllmain function into my Fortran code. I can think of two possible ways to do it. One is first creating a static library with all my cuda fortran and openacc code, then creating a dll in visual studio that links against this static library. This may require me to create a thin C/C++ wrapper for my Fortran functions to be called from C/C++ side. The other way is creating a static library in visual studio with only this DllMain function and link against this static library when creating final dll with PGI Fortran compiler. Base on your experience, which approach do you think more suitable to solve this issue?
You shouldn’t need to write any wrapper code for use with DllMain. Instead, you should be able to simply compile it with a C++ compiler, like cl, and then include it with your object. It doesn’t need to call anything in your program, only the compiler runtime initialization routines which is written in C.
Now, I am using a lot “shoulds” here. That because I’ve not used cl in this context so haven’t tried this. The use case I worked on was for a customer that uses Intel C++ for the majority of the DLL, but PGI OpenACC for several key compute kernels, so used icl and xilink as the linker. In theory using cl should work fine as well, but having not specifically used it in this exact context, there may be some issue that need to be worked through.
For reference, here’s the build script I used for my toy example. This is not what you should use, but more just an example of what could be done.
Thanks for your help. I compiled your sample dllmain with cl and the problem of invalid thread id is gone. But now there is another problem. Mycode quits when getting to the point of first calling cusolver functions. So it seems that the cusolver library is not properly initialized. Do you know how to initialize both cublas and cusolver in dllmain? Or it could be that I linked to wrong cusolver library for dll creation. If so, do you know what is the right library to link against for cublas and cusolver in dll?
I figured the problem out. I accidentally commented the cublasCreate and cusolverDnCreate for debugging purpose. After uncommenting these two lines of code, I can call cusolve functions now.
However after several times call of cusolve function (cusolverDnZgetrf and cusolverDnZgetrs) my code gives me following error
0: DEALLOCATE: an illegal memory access was encountered
Failing in Thread:1
call to cuCtxSetCurrent returned error 4: Deinitialized.
Note when I restart the PC and start over, the first time run of my code can run to the end, but it gives wrong result and above error too at the end of the run. Then whenever I run my code, it always gives me this error and no result at all.
I really don’t know what’s going wrong here. Please help.
I figured the problem out. I accidentally commented the cublasCreate and cusolverDnCreate for debugging purpose.
Glad you were able to figure it out. I was going to say that unlike the compiler runtime that initializes on load, for libraries you typically need to call the library’s initialization routines, the “Create” calls in this case.
Typically when I see an error like this, the problem is not with the deallocate itself, but the kernel that was launched just before this. Errors sometimes don’t show-up until the next use of the device.
What kernel is called before the failing deallocate? Is it a CUDA Fortran kernel or a call to cublas/cusolver?
I think that the problem could be related to the thread synchronization because I added some print out statements into my code and the code sometimes gives me correct results, sometime wrong results, though the problem is still there. In OpenMP there is an implicit thread barrier at the end of each parallel loop region. I assumed that it is same in OpenACC. But I read a webpage saying that there is no implicit thread barrier at the end of OpenACC parallel loops. If no thread barrier, then how I can synchronize the threads?
There’s an implicit barrier at the end of a compute region (parallel or kernels) unless the user has added the “async” clause. Gang loops within a parallel compute region may not add an implicit barrier (though vector and worker loops do), which may be what you’re thinking of. Though if you have link to the page, I can see what it says to correct or confirm.
Thanks for your information. I think the problem is in the cublas and cusolver calls. From this forum I learned that they are non-blocking, which is the cause of my problem. How do I synchronize or let the host code wait until the last cublas/cusolver call finishes?
I solved the non-blocking issue of cublas and cusolver calls by adding cudaDeviceSynchronize after these calls.
Also I’m pretty sure now the error of :
TestFailing in Thread:1
call to cuCtxSetCurrent returned error 4: Deinitialized
is occurred during the quitting process of the dll. I called my function in dll several times with same arguments, every time the function gives correct result (I printed out the results to the screen). And the error shows up at the very end of screen after all function calls.
I think that I could miss some dll finalization functions for GPU.
Anyway thanks for your consistent assistance on my GPU journey.