Unspecified launch failure on kernel launch

I try to make a CUDA test sample, to debug the module I’m developing for my solution. In order to narrow down the scope of the problem, I make it as simple as it gets.

#include <iostream>
#include <cuda_runtime>

using namespace std;

__global__ void fillGPU(unsigned char* input)
    for(int i = 0; i < 256; i++)
        input[i] = [i];

int main()
    unsigned char* d_input;
    unsigned char input[256];

    cudaMalloc(&d_input, 256 * sizeof(unsigned char));
    cudaMemcpy(d_input, &input[0], 256 * sizeof(unsigned char), cudaMemcpyHostToDevice);

    fillGPU<<<1, 256>>>(input);
    cout << cudaGetErrorString(cudaGetLastError()) << endl; // returns "no error"
    cout << cudaGetErrorString(cudaGetLastError()) << endl; // returns "unspecified launch failure"

    cudaMemcpy(&input[0], d_input, 256 * sizeof(unsigned char), cudaMemcpyDeviceToHost);
    cout << int(input[255]) << endl; // expected 255, get 0
    return 0;

I’m running nVidia Jetson Nano on JetPack 4.4, CUDA version 10.2.
nvcc sample_cuda.cu -o sample_cuda compiles the code no problem.
Is there a problem with the code, am I missing compiler flags, or is there something that must be done to an out-of-the-box JetPack 4.4 to get CUDA to properly launch kernels?

You are a bit off on two things. Firstly, I’m surprised that this expression compiles:

        input[i] = [i];

I think what you meant to put here was just:

        input[i] = i;

Secondly, you may want to note that you are currently having each thread fill 256 slots of your “input” array. I assume you wanted a one-to-one correspondence between threads and slots in “input”. If that’s what you intended (it should be much faster), your kernel should look like this instead:

__global__ void fillGPU(unsigned char* input)
  input[threadIdx.x] = static_cast<unsigned char>(threadIdx.x);

Firstly, I’m surprised that this expression compiles

Indeed, I made a typo here as I didn’t copy-paste the code into the post. Can’t do that from a Nano to the machine I typed the post on.

If that’s what you intended (it should be much faster), your kernel should look like this instead

Thank you, this was indeed what I intended, but even after I changed the code to what you suggested, the result output is still 0 and the error checks still return unspecified launch failure.

Ah. This is because you’re calling your kernel on input. That variable is a pointer to host memory. You’re passing it to a kernel where it is dereferencing to some nonsensical place in GPU memory. You should instead pass d_input to your kernel.

Silly me, that was indeed the problem, thank you. In retrospect, that wasn’t worth creating a thread here, but at least if someone else has this issue, they won’t have to do it.

1 Like