cudaOccupancyMaxPotentialBlockSize - invalidDeviceFunction Error in CUDA 10

I am encountering a problem when using the Occupancy API in CUDA 10. When calling the function CUDA is returning an invalidDeviceFunction when using the below code:

struct CUDAPotentialBlockSize
{
    public: 
        template< class T>
        CUDAPotentialBlockSize( T func )
        {
            checkCudaError( cudaOccupancyMaxPotentialBlockSize( &m_minGridSize, &m_blockSize, func, 0, 0 );
        }
    
        int m_blockSize;
        int m_minGridSize;
}

This is how I would be using the above structure in my source code.

__global__ void kernel()
{
}

void callKernel( const cudaStream_t stream )
{
    static CUDAPotentialBlockSize kernelParams( kernel );

    // Would normally have parameters here...
    kernel <<< 16, kernelParams.m_blockSize, 0, stream >>>( );
}

Right now I am attempting to run similar code to the above on a GTX770 which is of compute capability 3.0. No matter what function I pass into the constructor for that struct, the cuda API function ALWAYS fails on CUDA 10.0. I have also made sure to check that I have the compiler set up to compile device code for that architecture.

This code works properly on the CUDA 8 and 9 toolsets. Is there any reason why this would not work on CUDA 10?

I should also note that the above code is being used in an environment where I am using separable compilation using CMake. This code did not have any issues before upgrading my CUDA version from 8 to 10, however. I am also using a Centos 7 ( similar to Red Hat).

Your posted code (e.g. struct CUDAPotentialBlockSize) contains syntax error suggesting it could not possibly be compiled. When I fix those and create a test case out of it, I get no errors of any kind, running on CUDA 10.0.130 on linux:

$ cat t366.cu
#include <helper_cuda.h>
struct CUDAPotentialBlockSize
{
    public:
        template< class T>
        CUDAPotentialBlockSize( T func )
        {
            checkCudaErrors( cudaOccupancyMaxPotentialBlockSize( &m_minGridSize, &m_blockSize, func, 0, 0 ));
        }

        int m_blockSize;
        int m_minGridSize;
};


__global__ void kernel()
{
}

void callKernel( )
{
    static CUDAPotentialBlockSize kernelParams( kernel );

    kernel <<< 16, kernelParams.m_blockSize>>>( );

}

int main(){

  callKernel();
  cudaDeviceSynchronize();
}
$ nvcc -I/usr/local/cuda/samples/common/inc t366.cu -o t366
$ cuda-memcheck ./t366
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

Sorry about that, my source code is on a non-internet connected machine so I had to hand type some of the code I was using. Some of these functions and macros I am using are in some of my project’s headers. ( checkCudaError for one, I should have provided the source for that ).

Anyway, I was able to get it to work when calling it directly from my main. However, when i use it within my application’s API I am encountering the problems. The struct created above exists in a header which I use in each source file that I use it in. The problem is appearing when I am actually using the source on my real source code. I.e. not as a simple test.

Most of that source is likely too complex to post here though… One of the cases where it is failing is when I have the kernel in a namespace like dewm::cuda::insertTrianglesKernel. This is a kernel to insert a vector of triangles objects and keys into a hash table I have created which handles templated types. I also have the call to this function and the kernel appearing in a member function of a class. ( The kernel exists outside the class in the same namespace. )

I’ve gone through the symbol dumps for these kernels and have seen that the PTX does exist for the architectures in question.

How can I check this visibility to ensure that this is the case? My code has multiple link stages so it could be possible the kernel symbols are getting lost when getting to the test case which calls this kernel on my side.

Right now I have a CUDA library which gets compiled with anything that would need to be compiled with NVCC and a seperate library which contains all of our C++ only code that doesn’t use CUDA ( most of our application still). This was done so that the CUDA required code wouldn’t get compiled if someone doesn’t have CUDA on their machine.

CMAKE is currently linking all the separate compilation units of the CUDA code into a single intermediate compilation object. That intermediate compilation object is then used to link the Project CUDA library. The CUDA library gets linked into the main library and then that library is linked with any tests or applications that use it.

I felt like the comment I made was pretty useless, so I deleted it.

To wit: the code you have now is already telling you that the kernel symbol is not visible. So you have already built a method to check visibility. What you have to do now is solve it.

My suggestion is to provide a self contained stand alone test case that someone else could play with, if you want help.

resolution of namespaces is occasionally something that trips up the CUDA front end, so this certainly may be a CUDA issue, and it may also explain the difference between CUDA 9 and CUDA 10.

Any hope at solution in my view still rests on presenting a test case.

I did figure out the issue is related to the the way my code is being linked. The namespace isn’t a problem when I put everything into the same compilation unit. It DOES become a problem when I link from my DEWM_CUDA.so file to DEWM.so, and finally to the TestOccupancyAPI executable.

I believe this appears to be a CMake problem more than likely so I’ll look into that. Apparently CMAKE should be doing a device link according to this: https://devblogs.nvidia.com/building-cuda-applications-cmake/ I am getting an intermediate link into the library (DEWM.so), but no device link to my executable (TestOccupancyAPI). Could this be the problem?

I’d have to write up a simpler CMake script to post a test case here, though. Either way, it’ll have to wait until tomorrow.

Yes, if you require device symbol visibility across compilation modules, device linking is necessary.

So the kernel is defined in module A, and you are calling the callKernel function in module B?

Because that wouldn’t work without device linking, even apart from this issue. You cannot invoke the kernel in module B, if it is defined in module A, without device linking.

Yes, that is what I have been trying to do. So, in CMake, if I add the device link property (CUDA_RESOLVE_DEVICE_SYMBOLS) to each of my executables the process should work correctly? I’ll try this tomorrow and see if that fixes my problem. Apparently, setting the cuda_seperable_compilation flag in CMake should already be setting this up according to that article I posted earlier ( though it may be possible it isn’t propagating the linking to my second layer for the tests and executables. )

I am also curious why this did work under CUDA 8, however. I didn’t enable device linking on my executables and they would execute the occupancy API functions without error.