How to compile CUDA part to binary and store it in C array for further calls?


suppose, I have several kernels

void global A1(…)
{ …

void global A10(…)
{ …

and I want to compile them to binary and these binary part will be used in C sources later.

The reason why I need it: I have CUDA complicated algorithm that I do not intend to provide in sources, but I want to provide C/C++ code to my customer with CUDA binary and rest part as sources, that this project can be build in any possible Linux/Windows 32/64 platform.

Can I put data from CUBIN or FATBIN file somehow into C array, and later us somewhere? If yes, please, suggest me how!

Thank you


nvcc has a library mode. I would have thought you could use that to build a library that can be linked against at runtime, much like CUBLAS.

If you mean -lib option, it is exactly what I want to escape! Please, correct me if you mean something different!

My customer asked for algorithm for Windows 32bit, Windows 64bit, all possible Red Hats (32 and 64 bit) and several SuSE. I have no possibility to build all possible libraries for all these platforms.

One possibility for me, to give him sources, but I do not want to open all my optimization tricks and mathematical part of the algorithm. So, I need CUDA part to be compiled but incorporated into sources.



I did mean the -lib option. But don’t you only have to build 32 and 64 bit DLLs and 32 and 64 bit ELF versions of your code?

Can’t you just load your cubin manually ???

It is just what I need to know how!!!

Just compile with -cubin flag your cu file, you’ll get a *.cubin file. Store it in your program as you want - as a static DWORD array, or as a resource, or load it via fread - your imagination is the limit here. I use driver api to operate on a cubin. For that purpose I wrote simple wrapper functions to manage kernel launches, i.e - you get something like this:

module = cuda::ModuleLoadData(sn::LoadResource(resourceid, "CUDA"));

fn_transpose = cuda::ModuleGetFunction(g_module, "_Z9transposeP6float2S0_ii");


cuda::Launch(fn_transpose, cuda::Dim(wx/16, wy/16), cuda::Dim(16, 16), cuda::Params(m_devM1, w, h));


The good thing about using driver api is that you don’t need to mess around with cuda runtime dll’s and you can simply just give a compiled library to your client, and no problems with different platform support such as x86 or x64 platforms.

I could share wrappers I use if you want.

Dear SergeyN,

yes, it is exactly what I want. I have cubin file, converted it to unsigned char array

unsigned char CUDA_KERNELS={…};

and want to call these kernels as you proposed. Actually I cannot find ModuleGetFunction and Launch functions in cutil.h, cuda.h Please, would you suggest me where they are, and where I can find some documentations or man pages for them, or, if it will be easy for you, may I ask you to provide a wrapper, as you proposed.

Thank you!


I told you that these are simple wrappers written on my own, i.e - ModuleGetFunction simply calls cuModuleGetFunction and checks for errors, Launch is a slightly more complex though. I can mail the source to you if you message me your mail address.

Also keep in mind that if you use this approach, then you need to append a zero-terminator byte to your cubin data , or cuModuleLoad will fail otherwise.

Hello sergeyn,

I am trying this approach, i.e, embed the cubin in resource section of dll, and loading it using FindResource, LoadResource, LockResource. passing the void* obtained by LockResource() to cuModuleLoadData generates an CUDA_ERROR_INVALID_IMAGE; I copied the resource to another buffer and added \0 to the end and try cuModuleLoadData again but it still returns CUDA_ERROR_INVALID_IMAGE.

I embed the cubin by adding a line in the .rc file as

IDR_CUBIN1 CUBIN “path\to\cubin.cubin”

Could you share how you pre-process your cubin to be included in resource?

ps: you should check out bin2c in your cuda/bin directory. that program is great.