Invoke cudaFuncGetAttributes in kernel

Hi, when I try to invoke cudaFuncGetAttributes in a kernel function defined as global, I got cudaErrorUnknown. Here is my code:

__global___ void foo() {
    return;
}

__global__ void test_kernel() {
    cudaFuncAttributes attr;
    auto ret = cudaFuncGetAttributes(&attr, foo);
    
    printf("%d\n", ret);
    printf("%x\n", foo);
    printf("%d\n", attr.binaryVersion);
    return;
}

int main() {
    test_kernel<<<1,1>>>();
    cudaDeviceSynchronize();

    cudaFuncAttributes attr;
    auto ret = cudaFuncGetAttributes(&attr, foo);
    
    printf("%d\n", ret);
    printf("%x\n", foo);
    printf("%d\n", attr.binaryVersion);
    return 0;
}

On host side I can get the address and the attributes of foo, but on device side the test_kernel prints 0 for the foo’s address and the cudaFuncGetAttribute returns cudaErrorUnknown.

I was using A100 with cuda 11.6.

Using the device runtime API usually requires specifying relocatable code with device linking, and I also recommend explicitly linking against the device runtime, although this last step may not be necessary depending on CUDA version being used.

Your code runs with expected output for me when I add the compilation switches -rdc=true -lcudadevrt.

$ cat t4.cu
#include <cstdio>

__global__ void foo() {
    return;
}

__global__ void test_kernel() {
    cudaFuncAttributes attr;
    auto ret = cudaFuncGetAttributes(&attr, foo);

    printf("%d\n", ret);
    printf("%p\n", foo);
    printf("%d\n", attr.binaryVersion);
    return;
}

int main() {
    test_kernel<<<1,1>>>();
    cudaDeviceSynchronize();

    cudaFuncAttributes attr;
    auto ret = cudaFuncGetAttributes(&attr, foo);

    printf("%d\n", ret);
    printf("%p\n", foo);
    printf("%d\n", attr.binaryVersion);
    return 0;
}
$ nvcc -o t4 t4.cu -rdc=true -lcudadevrt
$ compute-sanitizer ./t4
========= COMPUTE-SANITIZER
0
0x7f93bf079e00
90
0
0x55e8a6667060
90
========= ERROR SUMMARY: 0 errors
$

It’s not rational to expect the foo pointer to print out with the same value from host and device side, because in CUDA it is generally UB to take the address of a device entity in host code.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.