How to access cuda kernel binary in GPU?

I’m working on a demo to encrypt cuda kernel binary in CPU side, and decrypt it in GPU side. The purpose is to make cuda kernel confidential in PCIe channel.
Now the CPU side things work well, however in GPU side there are some issues:

  1. how could I know where a kernel locates in GPU memory? I found an sanitizer API sanitizerGetFunctionPcAndSize that may solve the problem, but the API call just failed.
  2. After getting the code location, I hope to read&modify the kernel binary in GPU side through another kernel. Is read&write to kernel code allowed? Can I change the code part to RW permission by settingCUmemAccess_flags?
  3. If code part cannot be modified, can I alloc a buffer in GPU side, put code into the buffer, and modify cukernel metadata to point to the new buffer to execute? Is there something similar to NX bit in x86 to prevent a random buffer to be executed? If not, how to change the CUkernel metadata to point to the new code?

My code with sanitizerGetFunctionPcAndSize:

void SANITIZERAPI
my_callback(void *userdata,
            Sanitizer_CallbackDomain domain,
            Sanitizer_CallbackId cbid,
            const void *cbdata)
{
    if (domain == SANITIZER_CB_DOMAIN_LAUNCH) {
        Sanitizer_LaunchData *ld = (Sanitizer_LaunchData *)cbdata;
        if (cbid == SANITIZER_CBID_LAUNCH_BEGIN) {
            std::cout << "launch begin" << std::endl;
            std::cout << "mod: " << ld->module << ", name: " << ld->functionName << std::endl;
            uint64_t function_pc = 0, function_size = 0;
            auto ret = sanitizerGetFunctionPcAndSize(ld->module, ld->functionName, &function_pc,
                                            &function_size);
            std::cout << "ret: " << ret << ", function pc: " << function_pc << ", function size: " << function_size << std::endl;
        } else if (cbid == SANITIZER_CBID_LAUNCH_AFTER_SYSCALL_SETUP) {
            std::cout << "launch after syscall setup" << std::endl;
        } else if (cbid == SANITIZER_CBID_LAUNCH_END) {
            std::cout << "launch end" << std::endl;
        }
    }
}

the callback is successfully called, but the get pc call returns error. The output:

launch begin
mod: 0x56276de445e0, name: _Z9vectorAddPKfS0_Pfi
ret: 1, function pc: 0, function size: 0
launch after syscall setup
launch end

I do not think that it is allowed for the priority levels of normal CUDA functions (or compute kernels) to modify code. (If others know better, they should answer and correct me.)

What you could do, is write an interpreter for your custom language in Cuda and encrypt that interpreted language, transfer over PCIe and then decrypt on-the-fly on the GPU.

If you keep that language close enough to CUDA capabilities, it can run fast enough for many tasks.

I think the interpreter solution has exactly same issue: how could “decrypt on-the-fly on the GPU” happen? That also requires GPU-side code generation/modification. If that can be done, then decrypt kernel code in GPU side can be done.

I would appreciate if you can provide more info about the interpreter solution!

An interpreter reads the code from data memory, not from code memory.

Depending on the read command you would execute an instruction, e.g. to read memory, to do computations, etc. Those can be high-level or low-level.

In many cases, it would slow down your kernel, but it depends. High-level is faster, low-level is more secure.

You invent your own binary language.

Start with a struct:

Codeword
Source Parameter 1
Source Parameter 2
Target

Then the codeword signifies: addition, multiplication, …
And the source and target could be an address in shared or local memory.

I got your point, you mean running a interpreter (some CUDA kernels) in GPU side, and execute instructions on behalf of the encrypted kernel.
The interpreter is too heavy for our case, and there seems lots of things need considering: threads, stream, calling another kernel…

A low-level interpreter can be quite short, shorter than the average kernel.
A high-level interpreter could just call five functions with their parameters and be fast.

But it seems you already have quite involved kernels that should be fully encrypted. And be encrypted over PCIe.

Perhaps there is a HW feature of some enterprise datacenter GPUs? Otherwise it will be difficult.

Another theoretical variant is, if you can encrypt your results.
E.g. you pseudonymize the data, the kernel never returns the actual results.
But probably that is not enough in your case, where you want to safeguard algorithms.

You could run them “over the cloud” or in your data centers instead of at an exposed site?

Or if it is a critical industry, perhaps you can make special deals with Nvidia.
Nvidia could provide software or hardware solutions.
I am sure Nvidia can access code in a read-write fashion on the GPU and run special code before the kernel starts. That decryption routines would not run in standard-compute kernels, but one with higher privilege levels, to have that write access. Your actual kernel would still be a standard compute kernel.
There may be simpler variants.

confidential computing seems to be a technology that could be relevant. I’m not suggesting I’ll be able to answer detailed questions about it here in this forum. (There is a CC forum).

It seems to me, because I ran a simple test, that you cannot get a pointer to code space on the CUDA GPU and do ordinary read or write on the data in any way.

confidential computing seems to be a technology that could be relevant.

CC mode is only available in Hopper. We want to protect PCIe for legacy GPU without CC support.

It seems to me, because I ran a simple test, that you cannot get a pointer to code space on the CUDA GPU and do ordinary read or write on the data in any way.

I succeed to read/write kernel code by:

  1. use nvbit’s API nvbit_get_func_addr to locate the function
  2. read/write the code inside another kernel.

Hope that can help you, though there may be some tricks in NVBit. Your test is interesting, I’ll also have a try.

1 Like

So your question here is answered, then?

I’ve tried the sample test , it works on my environment:

  1. the binary of foo can be printed out and the output matches the binary I get through cuobjdump. – So code section is readable (at least in some versions).
  2. If I modify the code by xor 0xff, then revert it by xor 0xff again: no error is thrown, and the code is still executable. – So code section is also writable.
  3. If I modify the code by xor 0xff without revert: then a cuda error about illegal instruction will be thrown. – So the code modification takes effect.

about my environment:

nvidia-smi
Wed Jul  9 14:29:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      Off |   00000000:B1:00.0 Off |                  Off |
| 32%   41C    P8             19W /  425W |      15MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

nvcc --version                                                                                                 nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Fri_Jun_14_16:34:21_PDT_2024
Cuda compilation tools, release 12.6, V12.6.20
Build cuda_12.6.r12.6/compiler.34431801_0