Error: a __device__ function call cannot be configured

Hello, I want to implement CDP for a basic forward function (I will call the forward function to many times at the same time (also from a cuda function) and because of that I want to use CDP)

Here’s the code that I’m trying to run;

__device__ void NNFeedForwardNormalMultiple(double* __restrict__ values, double* __restrict__ weigths, double* result, int inputsize, int outputsize) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int outputidx = idx / outputsize;
    int inputidx = idx % outputsize;

    if (outputidx >= outputsize || inputidx >= inputsize) {
        return;
    }

    atomicAdd(&result[outputidx], values[inputidx] * weigths[outputsize*outputidx + inputidx]);
}

__device__ void NNFeedForwardNormalActivate(double* __restrict__ biases, double* result, int size) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx >= size) {
        return;
    }

    result[idx] = 1.0 / (1.0 + exp(-(result[idx] + biases[idx])));
}

__global__ void NNFeedForwardNormal(double* __restrict__ values, double* __restrict__ weigths, double* result, double* __restrict__ biases, int inputsize, int outputsize) {
    int blocksize = (inputsize * outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK;
    NNFeedForwardNormalMultiple<<<blocksize, THREADS_PER_BLOCK>>>(values, weigths, result, inputsize, outputsize);
    cudaDeviceSynchronize();
    NNFeedForwardNormalActivate<<<(outputsize + THREADS_PER_BLOCK - 1)/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(biases, result, outputsize);
}

I also tried to run CDP from a device function like this but still gave me the same error;

__device__ void NNFeedForwardNormalMultiple(double* __restrict__ values, double* __restrict__ weigths, double* result, int inputsize, int outputsize) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int outputidx = idx / outputsize;
    int inputidx = idx % outputsize;

    if (outputidx >= outputsize || inputidx >= inputsize) {
        return;
    }

    atomicAdd(&result[outputidx], values[inputidx] * weigths[outputsize*outputidx + inputidx]);
}

__device__ void NNFeedForwardNormalActivate(double* __restrict__ biases, double* result, int size) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x;

    if (idx >= size) {
        return;
    }

    result[idx] = 1.0 / (1.0 + exp(-(result[idx] + biases[idx])));
}

__device__ void NNFeedForwardNormal(double* __restrict__ values, double* __restrict__ weigths, double* result, double* __restrict__ biases, int inputsize, int outputsize) {
    int blocksize = (inputsize * outputsize + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
    
    NNFeedForwardNormalMultiple<<<blocksize, THREADS_PER_BLOCK>>>(values, weigths, result, inputsize, outputsize);
    NNFeedForwardNormalActivate<<<(outputsize + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(biases, result, outputsize);
}

__global__ void NNFeedForwardNormalWrapper(double* __restrict__ values, double* __restrict__ weigths, double* result, double* __restrict__ biases, int inputsize, int outputsize) {
    NNFeedForwardNormal(values, weigths, result, biases, inputsize, outputsize);
}

And also tried cudaLaunchKernel function and using global instead of device but they didn’t work either. I’m using -rdc=true flag too and also my arch is sm_75 which should support CDP

he analysis of your implementation issue with CUDA Dynamic Parallelism (CDP) suggests several key points to consider:

  1. CDP Usage Errors:
    1. Ensure you are correctly setting up CDP in your CUDA code. CDP allows for launching device kernels from other device kernels, which requires proper syntax and structure.
    2. If you encounter errors, double-check the kernel launch configuration and ensure that any device-side code that calls other device kernels adheres to CDP guidelines.
  2. Compiler Flag -rdc=true:
    3. This flag must be enabled for separate compilation and linking of device code, which is essential for CDP. Make sure you are using this flag correctly in your compilation command to support CDP.
  3. Device vs Global Keyword:
    4. Typically, the ‘device’ keyword is used for functions that run on the GPU. When calling device functions from other device functions, ‘device’ should be appropriate. However, if you’ve tried using ‘global’ and it didn’t work, this might indicate a misuse of the keywords in the context of kernel launches. Ensure you’re using the correct definitions based on what you’re trying to achieve with your function.
  4. Error Handling:
    5. Review any compiler warnings or errors during compilation. Utilize debugging tools like cuda-gdb or cuda-memcheck to help pinpoint issues in your CDP implementation.
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.