Consistency of functions pointer

Dear All,

I recently tried to use function pointer to dynamically define several processing stage in my application, running on a sm_30.

It would be difficult to post the code here, as there are many differents files and functions implicated, but basically, I started from the sample that was included in the Cuda Toolkit 5.0.

I allocate a device function buffer where I copy a device function pointer, defined just as in the sample thanks to cudaMemcpyfromsymbolAsync used with DeviceToDevice copy Kind.

My device pointer is defined like this :

typedef void (*func)(structGpuArgument*);
__device__ func gpuFuncPtr = gpuFunc1;


func* pFuncDevBuffer;
cudaMemcpyFromSymbolAsync( pFuncDevBuffer+i ,gpuFuncPtr,sizeof(func),0,cudaMemcpyDeviceToDevice,stream)

In fact, everything works fine as long as the global kernel that takes the device function buffer in argument is defined in the same file that the function and its pointer.
The kernel then can print out the address of the function (0x4) and execute its code without problem
I don’t use the separate compilation.

When, in the same instance of the program a second kernel, defined elsewhere takes the very same function pointer buffer in argument, it can print out the very same memory address for the function pointer (0x4) but if it tries to execute it, it fails issuing a unspecified launching error, and any other cuda API call freezes after, I need to reboot my computer (reset through cuda-smi isn’t supported on my gpu).

I would like to know if there is a known issue in using function pointer this way, ie by using a function pointer buffer defined in an other file, but sharing the same function pointer definition.

Also if there is a workout for reseting a device after a segfault without rebooting the whole system, it could help me to save time while debugging my application.

Thank you for your help
function_pointer.txt (2.73 KB)

I achieved to reproduce my problem on a dummy example in nsight, and it seems that all my problems are related to separate compilation.

Without changing anything, my simple example fail when compiled with whole program option but success when used with separate compilation and relocatable device code.

But I don’t understand anything about that feature, I though that memory wasn’t process bounded on Nvidia devices, and that there was no memory space checking.
So why am I unable to acces/load function code unless using relocatable code, which, in my mind, have only relative adress.

Moreover, as I am using cmake.cuda to compile my project, I added separate compilation and relocatable options, but now, I have tons of undefined references:

undefined references to ‘[…]_cudaRegisterLinkedBinary[…]’
__sti___cudaRegisterAll […]

Did someone have an idea on the role of separate compilation and relocatable device code in my problem ?

You will find my “dummy example” attached to this post, it could be imported in nsight easily. (117 KB)

I have exactly the same problem and no one answered me as well. I think the problem is, that no one in scientific environments wants to write good code and this is why no one is interested.
Furthermore I believe, that this is a problem of NVCC.

I think most people (myself included) have limited experience with function pointers in CUDA because for a long time they were not supported in CUDA, and they incur some potentially large performance overhead when used. Add in separate compilation (an even newer CUDA feature) and you are going to run into even fewer people with experience. On top of that, only a small fraction of CUDA developers read this forum.

The desire for good code in scientific environments has very little to do with the lack of responses. :)

As the post get back recently, I am now able, not to answer all of my above questions, but to mention a workout about the CMake thing.

I appeared that the feature was working in the most recent release of cmake module FindCUDA.cmake (>2.8.10, but more recent is better) that made cuda separate compilation available and properly working in cmake projects when using something like that:


list(APPEND CUDA_NVCC_FLAGS "–relocatable-device-code=true ")

OPTIONS -gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_30,code=sm_30

Unfortunately I cannot remember exactly what was the problem with those undefined references, if you have the same problem, check if there is no global variables of things like that, and rebuild the whole project if needed.

Each compilation unit in CUDA results in a CUmodule (Driver API type). The CUDA Runtime hides CUmodules from the developer. CUfunctions (device or global) in separate CUmoduels cannot call each other through function pointers. Separate compilation can be used to statically link two separate compilation units.

There are multiple reasons why cross CUmodule functions cannot be reliably called using function pointers:

  1. Each CUmodule has its own separate constant area. When a kernel is launched the kernel is configured to reference the its CUmodule constants. If you call a device function in a different CUmodule using a function pointer then that device function will be incorrectly reading the wrong constants.

  2. Each global function is launched with a specific launch configuration. If the device function references a higher register than the launch configuration the SM will throw an exception.

  3. Each global function has a shared memory map. If you call a device function in a different module then the compiler will not have correctly allocated the shared memory. Since shared memory is statically mapped this will cause corruption in shared memory.

There are several other reasons but I think this list will give you a good list of reasons why you shouldn’t try this.

I am not aware of any documentation that lists this restriction.