Adding new kernel via CUmoduleManager

I’m trying to add some processing functionality to the Cuda Video Decoder example (cudaDecodeGL in the Cuda Samples directory of the Cuda 8.0 install). I’m new to the use of ptx files and the module manager and such (I learned how to define and launch kernels with the approach described in the ‘Cuda By Example’ book). Anyhow, I believe the process for creating and launching a new kernel is as follows. (I’m using the existing file to add my code):

  1. Create the cuda kernel in the file. For example, I just added the lines at the bottom:
__global void kernel(void)

… certainly a NOP kernel - but a kernel nontheless…

Then, in the videoDecodeGL.cpp file:
2. Defined a CUfunction handle for my kernel:

CUfunction g_mykernel = 0;
  1. There already is a line of code that defines the module manager:
g_pCudaModule = new CUmoduleManager("NV12ToARGB_drvapi64.ptx", exec_path, 2, 2, 2);
  1. Then added the line
g_pCudaModule->GetCudaFunction("kernel", &g_mykernel);

However, when I add this last line of code and run the program, it crashes when it gets to this line with the error:

Exception thrown at 0x00007FF613174017 in my_cudaDecodeGL_v2.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.

If I comment out this line, the program runs normally.

I’m working on Windows 10 in Visual Studio 2015. I’m pretty sure (well - I’m actually just assuming) that the *.ptx compiled version of my *.cu file is getting compiled along with the overall code. Is that right? Or do I have to perform some intermediate and specialized nvcc compilation of the *.cu file before compiling/running the main program?

Clearly, I’m not telling the system how to add a new kernel in the correct way. However, I’m having a hard time finding resources online to explain what to do and help me figure what I’m doing wrong here. Can anyone help?

I’d suggest studying some of the simpler driver API sample codes such as vectorAddDrv

Thanks for the heads up on where to look. However, are there any samples or is there any literature that describes how to take a *.cu Cuda file and generate the proper .ptx file?

Study the makefile that comes with the vectorAddDrv sample project.

In a nutshell:

nvcc -ptx …

I did some more searching - I suppose my answers list in the programming guide at

Just wanted to update this since I finally figured out how to do it. @txbob gave good advice, the vectorAddDrv project shows how to use, essentially, the cuModuleLoadDataEx to create a new CUmodule (e.g. I named mine: ‘my_module’)(that you can tie to your own *.cu kernel file).

Once you have the ‘my_module’ CUmodule handle, you can get pointers to each individual kernel you want to use with the command:

CUresult my_result = cuModuleGetFunction(&my_cudakernel, my_module, "my_kernel");

where my_cudakernel is a CUfunction variable and ‘my_kernel’ is the name of the kernel in your *.cu file that you want to tie to this my_cudakernel identifier.

And then armed with that you can launch the (simplest of) kernel(s) with commands like:

dim3 block(32, 16, 1);
dim3 grid(16, 16, 1);
CUresult status;
void *args[] = {&d_arg1, &d_arg2};
status = cuLaunchKernel(my_cudakernel, grid.x, grid.y, grid.z, block.x, block.y, block.z, 0, NULL, args, NULL);

… assuming your kernel takes two input parameters (d_arg1 & d_arg2). The Cuda Samples are pretty useful :)