Cannot use shared memory with compile option `compile-as-tools-patch`

mchan · March 10, 2021, 11:17am

I’m using NVBit to develop a instrumentation tool.
But it seems I cannot use shared memory inside the inject device function. For example:

extern "C" __device__ __noinline__ void inject_kernel(int input1, int input2) {
    extern __shared__ long x[];
    ...
}

And this code is compiled with the command:

nvcc -Xptxas -astoolspatch --keep-device-functions -Xcompiler -fPIC -c inject_kernel.cu -o inject_kernel.o

The compiler error:

ptxas error   : Allocating additional shared memory is not allowed when command line option '--compile-as-tools-patch' is specified 
ptxas fatal   : Ptx assembly aborted due to errors

I might understand that a static shared memory is not allowed to use here. But why cannot declare a dynamic shared memory?

Is there any hacking way to use shared memory here?

njuffa · March 10, 2021, 11:30am

What are the indications of that?

striker159 · March 10, 2021, 12:33pm

How about

__device__ void inject_kernel(long* x, int input1, int input2) {
...
}

Then pass the pointer to shared memory from the __global__ function to your __device__ function.

mchan · March 10, 2021, 1:26pm

I’m sorry I forgot to post the compiler error message. I just updated the question, PTAL, thanks.

mchan · March 10, 2021, 1:51pm

I think that won’t work. IMHO, since the compiler does not know which __global__ function would call this __device__ function, the compiler would assume x is from the global memory and use the LDG instruction which would load the data from global memory instead of LDSinstruction that loads data from the shared memory.

Robert_Crovella · March 10, 2021, 2:16pm

How would the following statement:

extern __shared__ long x[];

help if you don’t control the kernel launch or the design of the kernel?

Even if you control the kernel launch (which I’m not sure NVBit has in view) if you don’t control the design of the kernel, and the kernel happens to use dynamically allocated shared memory, you can’t “stack” some additional memory on top of the allocation just by declaring another pointer.

What is your thought process? I guess the only thing you could do if you control the kernel launch is to inspect the original kernel launch, and offset your use of your own shared pointer by the number of bytes in the original kernel launch, and then augment the kernel launch with the number of bytes you need?

In any event, I believe the designers of NVBit anticipated these difficulties (and probably others that I am unaware of) and stated the limitation clearly in the governing paper:

“Shared and constant memory usage: Injected functions may
not use shared and constant memory because that memory can
be used by the application itself. Using it in an instrumentation
tool could cause the instrumented programs to fail. In practice,
programs commonly use all of the shared memory capacity, leaving
nothing for the instrumentation library itself regardless”

mchan · March 11, 2021, 12:16am

Thanks for you explanation. I just noticed that Limitation in their paper.
But what if I just want to read the value in the shared memory used by the global kernel outside instead of allocating a new piece of the shared memory?

mchan · March 12, 2021, 1:32am

I just found a workaround for accessing the shared memory. Just as @striker159 mentioned, I can pass a pointer to the shared memory from the __global__ function. And the compiler will use the generic LD instruction instead of LDG. And it seems I can get the address of shared memory from constant memory c[0x0][0x18] and c[0x0][0x1c]. (BTW, I’m using GV100.)

So the code would be like this:

extern "C" __device__ __noinline__ void inject_kernel(unsigned int r2, unsigned int r3) {
    unsigned long x = ((unsigned long)r3 << 32) | r2;
    long *shared_mem_ptr = (long *)x;
    printf("%ld\n", *shared_mem_ptr);
}

And for host injection code:

    nvbit_insert_call(target, "inject_kernel", IPOINT_BEFORE);
    nvbit_add_call_arg_cbank_val(target, 0, 0x18);
    nvbit_add_call_arg_cbank_val(target, 0, 0x1c);