I try to make a CUDA test sample, to debug the module I’m developing for my solution. In order to narrow down the scope of the problem, I make it as simple as it gets.
#include <iostream>
#include <cuda_runtime>
using namespace std;
__global__ void fillGPU(unsigned char* input)
{
for(int i = 0; i < 256; i++)
{
input[i] = [i];
}
}
int main()
{
unsigned char* d_input;
unsigned char input[256];
cudaMalloc(&d_input, 256 * sizeof(unsigned char));
cudaMemcpy(d_input, &input[0], 256 * sizeof(unsigned char), cudaMemcpyHostToDevice);
fillGPU<<<1, 256>>>(input);
cout << cudaGetErrorString(cudaGetLastError()) << endl; // returns "no error"
cudaDeviceSynchronize();
cout << cudaGetErrorString(cudaGetLastError()) << endl; // returns "unspecified launch failure"
cudaMemcpy(&input[0], d_input, 256 * sizeof(unsigned char), cudaMemcpyDeviceToHost);
cout << int(input[255]) << endl; // expected 255, get 0
cudaFree(d_input);
return 0;
}
I’m running nVidia Jetson Nano on JetPack 4.4, CUDA version 10.2.
nvcc sample_cuda.cu -o sample_cuda
compiles the code no problem.
Is there a problem with the code, am I missing compiler flags, or is there something that must be done to an out-of-the-box JetPack 4.4 to get CUDA to properly launch kernels?
You are a bit off on two things. Firstly, I’m surprised that this expression compiles:
input[i] = [i];
I think what you meant to put here was just:
input[i] = i;
Secondly, you may want to note that you are currently having each thread fill 256 slots of your “input” array. I assume you wanted a one-to-one correspondence between threads and slots in “input”. If that’s what you intended (it should be much faster), your kernel should look like this instead:
__global__ void fillGPU(unsigned char* input)
{
input[threadIdx.x] = static_cast<unsigned char>(threadIdx.x);
}
Firstly, I’m surprised that this expression compiles
Indeed, I made a typo here as I didn’t copy-paste the code into the post. Can’t do that from a Nano to the machine I typed the post on.
If that’s what you intended (it should be much faster), your kernel should look like this instead
Thank you, this was indeed what I intended, but even after I changed the code to what you suggested, the result output is still 0 and the error checks still return unspecified launch failure
.
Ah. This is because you’re calling your kernel on input
. That variable is a pointer to host memory. You’re passing it to a kernel where it is dereferencing to some nonsensical place in GPU memory. You should instead pass d_input
to your kernel.
Silly me, that was indeed the problem, thank you. In retrospect, that wasn’t worth creating a thread here, but at least if someone else has this issue, they won’t have to do it.
1 Like