why is it that when I execute a call to kernel 1 there is according to cudaGetLastError no error but when I make a second identical call to that same kernel immediately after then I get a segmentation fault? I also get similar segmentation faults for other different kernels.
why is it that when in emulation mode the kernel produces the correct initialization (non-zero) as read from inside the kernel, but when I try to memCopy that data back and print it out in host code it is all zero?
why is it that when I try to memCopy some data back from the first kernel that I get an invalid device pointer error? (this may well cause the above error)
why is it that when immediately after the first call to kernel 1 I cudaFree the memory on the device and immediately reallocate it I don’t get the segmentation fault?
I thought you could allocate memory on the device via cudaMalloc only once and then refer to it in as many subsequent kernel calls
as required. It now seems I need deallocate and the reallocate.
This is why I asked about device variables with multiple GPUs.
Are you calling cudaThreadSynchronize before cudaGetLastError() after the first kernel call?
I’m guessing your first kernel call is writing beyond allocated memory and so all bets are off concerning any amount of reproducibility or sanity when dealing with any following calls made. Valgrind can be an invaluable tool in finding these issues. Compile in emulation mode with debug symbols and run through valgrind to help identify the culprit. Also check for the passing of host pointers to the device. This will work in emulation but will fail with ULFs on the device.
Besides these general principles to debugging these problems, what do you expect us to do to help given the information you’ve presented? To really find the problem we are going to need a full and minimal sample code that demonstrates it so we can tell where you are going wrong.
Yes I am calling cudaThreadSynchronize before cudaGetLastError() .
I have attached some code below…a very simple initialization on the GPU.
What I would like to know is
why is it that when I allocate the array d_type using cudaMalloc I get a segmentation fault on the second call to the kernel, but when I allocate the same array as a device in the preprocessor I do not get a segmentation fault?
why is it that in both cases above I always get an invalid device pointer error when I call cudaMemcpy to copy device data back into the host array h_type?
I am compiling wth the command
nvcc -o Obstacleemu Obstacle.cu -L/home/chrism/CUDA2/lib -lcutil -L/opt/cuda/lib -lcudart -I/home/chrism/CUDA2/common/inc -deviceemu -lcuda -lglut -lpthread (I haven’t bothered to delete some of the pthread and partice setup stuff in the code)
Any suggestions will be welcome because this is driving me crayzee!
You are calling cudaThreadExit() after this first kernel call. This shuts down the context and everything you had in it on the GPU, including freeing any allocated memory. So you are getting expected behavior.
You never need to call cudaThreadExit() yourself. The runtime will automatically call it when the current host thread terminates.
Dynamically allocated memory is freed on cudaThreadExit(). Attempting to use a freed pointer results in undefined behavior.
The device X variables might reside in the same location when the context is shut down and restarted, but it is unsafe to depend on this behavior. It’s certainly unsafe to depend on them retaining their values.