Invalid device function

Hi, I’m new here so excuse me if I do any mistake but I really need your help, my job depend on it.

Currently, I work on a big project with lot of calculations that take few minutes. The idea is to use CUDA to reduce that calculation time. It’s a C# program.

I create a CUDA DLL in C that I import in C#. The call to the DLL function and the link work. I check it while debugging. So in this DLL, I have a, in this file (I entered in it), I call another function with CUDA thanks to that code :

Distribution_Kernel<<<1, block_size, z_max * sizeof(double)>>>(


	cudaStatus = cudaGetLastError();
	if (cudaStatus != cudaSuccess)
		printf("Distribution_Kernel launch failed : %s\n", cudaGetErrorString(cudaStatus));

I precise that variable in arg was well created before. So yes, I call a Kernel in a kernel, I know it’s possible since 3.5. I’m so sorry, I would like to give you more but it’s a top classified code for my company.

Unfortunaly when I execute that, I have that cudaStatus 8 : cudaErrorInvalidDeviceFunction which mean “The requested device function does not exist or is not compiled for the proper device architecture” according to But it exist. I even try to put my function in the same file, it doesn’t work either. While I built, I check several architure : compute_50,sm_50 ou compute_35,sm_35 ( i can’t do less otherwise I can’t call an kernel from another kernel). I have a GeForce GTX 760 with her last update installed. I got the “Generate Relocatable Device Code” to “Yes(-rdc=true)”

I’m stuck, I don’t know why I have this “Invalid device function” error. Please can you help me…Please.

“I call another function with CUDA thanks to that code”

make sure the other function is properly forward declared
it is safe practice to include your device kernels/ functions in a header

“I call a Kernel in a kernel”

can you note the declarations of the kernels/ device functions that you use? make sure not to call a device kernel as a device function, from the device

and i ‘appreciate’ the magnitude of your kernel parameters…

Above all, thanks little_jimmy for your anwser.

I include an Kernel.h file where we can find the declaration of my fonction:

__global__ void Distribution_Kernel(
	double* dev_Alpha,
	double* dev_Beta_j,
	double* dev_Cf,
	double* dev_Dcc,
	double* dev_Dccm1,
	double* dev_Dccm2,
	double*	dev_Dcr,
	double*	dev_Dm,
	double* dev_Dpcc,
	double* dev_Dpcc_temp,
	double* dev_Dpe,
	double* dev_Dpi,
	double* dev_Dpmax,
	double* dev_Dpp,
	double* dev_Dw,
	double* dev_Fac_Nk,
	double* dev_Frc_Nk,
	double* dev_Fon,
	double* dev_Fon_max,
	double* dev_J_max,
	double* dev_L_Am,
	double* dev_L_am_max,
	double* dev_L_da,
	double* dev_Leff,
	double* dev_Mt_Nk,
	int* dev_Nb_R,
	double* dev_Pmax,
	double* dev_Pon,
	double* dev_Pon_extr,
	double* dev_Pon_max,
	int* dev_Pon_maxIndex,
	double* dev_Rg,
	double* dev_Rp,
	double* dev_SROg,
	double* dev_SROp,
	double* dev_TValues,
	int* dev_Type_Cr,
	double* dev_X0r,
	double* dev_Xbpe,
	double* dev_Xm1,
	double* dev_Xm2,
	double* dev_Ybe,
	double* dev_Yccm1,
	double* dev_Yccm2,
	double* dev_Ycr,
	__int64* dev_Z,
	double* dev_Zccm1,
	double* dev_Zccm2,
	double* dev_Zcr,
	double Beta__Nk,
	double da__nk,
	double dr__nk,
	double Eps_eq,
	double Gamma,
	double GammaM,
	int ringType,

	int nk,

	int dim2,
	int dim3,
	int dim4,
	int dim5,
	int dim6

And I call this function thanks to the code above.

Distribution_Kernel<<<1, block_size, z_max * sizeof(double)>>>(...)

This call is made in a file name and the Distribution_Kernel function is in a file which each function is declared in a Kernel.h

And yeah, I got a lot of Kernel parameters but C don’t take class so I should send all properties.

i am struggling to put the puzzle pieces together
how many source files does your dll contain - one or multiple? it seems multiple: and
be sure to enable separate compilation in the case of the latter

“Kernel parameters but C don’t take class so I should send all properties”

but c takes structures and arrays, and for initialization these are workable
i wonder how many registers your kernel parameters would consume; it would probably not matter when your are merely taking the device for a walk/ sprint, but it should pop out its head the moment you start to up the pace

Well, to be honest, it’s not my code. The one who did left our company and I should replace him whereas I’m a beginner in CUDA. I generate my DLL thank to many source files, dllmain.c (my landing page that call, (that call with the code above the, which make the parallel code and which is a tool kit for our code. When I compile, it generate five files : DLL Cuda.dll, DLL Cuda.exp, DLL Cuda.ilk, DLL Cuda.lib and DLL Cuda.pdb. Then I copy those file in the folder which content my C# exe that use the DLL.

I took his PC so I know that Cuda is well installed and I heard it work once. So don’t really where this error come from… If you have any leads…

I know I wonder a lot, and still I apologise, but how can I enable separate compilation? I have only one compilation actually that generate my DLL. Just for your information, I use Visual Studio as IDE
Is it this => project properties->C/C+±>General->Multi-processor Compilation set to: Yes (/MP) . Cause it’s already the case.

“Well, to be honest, it’s not my code.”

well, seems that it is now…

“Just for your information, I use Visual Studio as IDE”

yes, i gathered that much rather early on

i do not think i would be able to inform you where to enable separate compilation for BS VS
i almost want to say that it is a project property alongside the architecture that you wish to compile for, but i then realize it is as good as saying: the sky is occasionally blue
perhaps other vs users can help you out with this one

but, let us also backtrack for a moment
can you confirm that both the dll and the main program utilizing the dll compile fine?
you mentioned using the debugger; can you confirm that the error occurs not in the main program, but in the dll? i.e. the main program actually manages to call and enter the dll, only thereafter do things go sour
can you then also confirm that the error occurs the moment the dll runs the distribution kernel? or, is it at a later point when the distribution kernel calls another kernel?

Well I google it and I find some result that confirm that I enable separate compilation.

In fact, there no error in the compilation. Both program compile fine.
I launched a debug version of my main program, that I launch my debug on the program that create the DLL which is attached to my main program. So I can debug in both application. When, in my main program, I invoke my external function distribution, my pointer go to my DLL and it work fine. I come to this line in my DLL code :

Distribution_Kernel<<<1, block_size, z_max * sizeof(double)>>>(...)

Then, I do that :

cudaStatus = cudaGetLastError();
	if (cudaStatus != cudaSuccess)
		printf("Distribution_Kernel launch failed : %s\n", cudaGetErrorString(cudaStatus));

It enter to the if and write me : “Distribution_Kernel launch failed : invalid device function”.
When I put a breakpoint in my Distribution_Kernel, it doesn’t active. So it doesn’t call Distribution_Kernel.

what device are you using, and what architecture are you compiling for?

perhaps make absolutely sure that you pass the correct kernel parameters, given the kernel declaration, and given the fact that it is ‘lengthy’

I have never seen an instance of “invalid device function” where it did not mean “driver could not find a matching binary or compilable PTX for the architecture of the installed GPU”.

First, you would want to establish that you can successfully build and run CUDA programs in general on the GPU in your system. Any of the simple examples that come with CUDA could serve that purpose.

If simple CUDA programs run compile your code with -arch=sm_30, since the GTX 760 is a device with compute capability 3.0 (sm_30). Since binary code is not compatible between architecture generations, you cannot run code compiled for compute capability 3.5 or 5.0 on a device with compute capability 3.0.

You can double check the generated machine code by disassembling the executable with cubojdump --dump-sass. Verify that disassembled code is shown for the kernel in question, compiled for the right architecture. Disassembled code for a kernel is going to look similar to this:

Fatbin elf code:
arch = sm_30
code version = [1,7]
producer = cuda
host = windows
compile_size = 64bit
identifier =

        code for sm_30
                Function : _Z6kernelIjEvPT_PKS0_
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                              /* 0x2282c28002c28007 */
        /*0008*/                   MOV R1, c[0x0][0x44];                      /* 0x2800400110005de4 */
        /*0010*/                   S2R R0, SR_TID.X;                          /* 0x2c00000084001c04 */
        /*0018*/                   ISETP.GT.U32.AND P0, PT, R0, 0x7f, PT;     /* 0x1a0ec001fc01dc03 */
        /*0020*/               @P0 BRA.U 0x80;                                /* 0x40000001600081e7 */
        /*0028*/              @!P0 MOV32I R5, 0x4;                            /* 0x18000000100161e2 */

[Later:] Going back to read this thread in its entirety I noticed that your goal is to use dynamic parallelism:

I am afraid you won’t be able to use a feature that requires compute capability 3.5 on a device that has compute capability 3.0, since compute capability 3.5 is a superset of compute capability 3.0.

Yes, it seems you’re right. I can’t call a Kernel in a Kernel because of material restriction.
Once I drop the Kernel call, no more error for distribution.
Thank you very much to help me, I really appreciate that.

i wonder whether in your case, dynamic parallelism, as implemented by your colleague, is efficient or redundant?

the easy option would simply be to upgrade the device
the prudent option may be to re-examine the utility of dynamic parallelism in this case