Kernel crash when GPU Debug Info is disabled in Visual Studio

Hi!

I’ve been having a problem with my main kernel that started when I tried running it in Release mode in Visual Studio. Some basic info:

  • Visual Studio 2017 v15.2
  • Cuda Toolkit 9.0
  • Nsight 5.6
  • Driver: 388.71
  • GTX 850M

Here is the code that launches the kernel and the kernel itself:

//includes...
static surface<void, cudaSurfaceType2D> cuda_surface;
static const int NUM_SAMPLES = 10;
static const int MAX_BOUNCES = 1;

__global__ void Integrate(World* world, Camera* camera, curandState* curand_states);

void PathTrace(SDL_Window* window, World* world, Camera* camera, curandState* curand_states)
{
    const unsigned int block_x = 32;
    const unsigned int block_y = 32;
    const dim3 block(block_x, block_y);
    const unsigned int grid_x = std::ceil((float)DISPLAY_WIDTH / block_x); //1920
    const unsigned int grid_y = std::ceil((float)DISPLAY_HEIGHT / block_y); //1080
    const dim3 grid(grid_x, grid_y);
    cudaBindSurfaceToArray(cuda_surface, GLCUDAGetCudaArray());
    Integrate <<<grid, block >>>(world, camera, curand_states);
    cudaDeviceSynchronize();
    DisplayUpdate(window, GLCUDAGetTexture());
}

__global__ void Integrate(World* world, Camera* camera, curandState* curand_states)
{
    const int y = blockIdx.y * blockDim.y + threadIdx.y;
    const int x = blockIdx.x * blockDim.x + threadIdx.x;
    if (y >= camera->lens_height || x >= camera->lens_width) { return; }
    const int idx = y *  camera->lens_width + x;
    const int texture_y = (y + 1 - camera->lens_height) * (-1);

    Ray ray;
    camera->GenerateRay(x, y, &ray);
    Spectrum L(0.0f);
    SurfaceInteraction isect_first;
    bool found_isect_first = world->Intersect(isect_first, ray);

    int num_samples = 0;
    if (found_isect_first)
    {
	for (int samples = 0; samples < NUM_SAMPLES; samples++)
	{
            Spectrum beta(1.0f); //Remember: beta=path throughput weight p.876
	    float u_light[2];
	    SampleGet2DArray(&curand_states[idx], u_light);
	    glm::vec3 wi;
	    float light_pdf = 0.0f;
	    Spectrum Li = world->point_light->SampleLi(isect_first, u_light, &wi, &light_pdf); //Irradience emitted by light
	    Spectrum f = isect_first.mesh->lam_ref->f(isect_first.wo, wi) * glm::abs(glm::dot(wi, isect_first.normal)); //BRDF * cos(theta)
	    Ray vis_ray(isect_first.point, world->point_light->position - isect_first.point);
	    bool light_visible = !world->Intersect(vis_ray, glm::length(world->point_light->position - isect_first.point));
	    if (!light_visible)
	    {
		Li = Spectrum(0.0f);
	    }
	    Spectrum Ld = f * Li / light_pdf;
	    L += beta * Ld;

            num_samples++;
	}
    }
    else
    {
	//Early exit -> first ray didn't intersect with scene, i.e. none will ever
	num_samples = 1;
    }

    L /= num_samples;
    L = glm::clamp(L, 0.0f, 1.0f); //Make sure that we stay in the range [0, 255] because of uchar
    /*L.x = curand_uniform(&curand_states[idx]);
    L.y = curand_uniform(&curand_states[idx]);
    L.z = curand_uniform(&curand_states[idx]);*/
    surf2Dwrite(make_uchar4(L.x * 255, L.y * 255, L.z * 255, 255), cuda_surface, x * sizeof(uchar4), texture_y);
}

So, the issue occurs when L’s members are “read” individually and used for some calculation like the one on line 71. L is basically the amount of light for the current pixel and each thread has its own instance of it. However, if L’s members are given new values(say using the commented out code on 68-70), reading the values of L’s members doesn’t crash the kernel, and the data in array cuda_surface receives correct data. E.g. this also works:

//Generate completely red image
L.x = 1.0f;
L.y = 0.0f;
L.z = 0.0f;

Something simpler than line 71 also fails:

if (x == 0 && y == 0) //Just do this for the first thread - just for testing
{
    printf("%f\n", L.x); //Fails
}

Now, this is in Release mode in Visual Studio, running with 02 optimizations and no debug info for neither the CPU nor GPU. When running in Debug mode, the above code doesn’t crash and produces a correct image. Turning off optimizations in Release mode didn’t resolve the issue, so it doens’t appear to be an issue linked to the optimized code that is produced. What “fixes” the issue is turning on “Generate GPU Debug Info” for Release. The code now runs at the same pace as the Debug version, but works.

I’ve now spent many hours searching around to find any relevant information that could aid me, but have come up empty handed as far I can tell…

Any ideas and input would be great! I’ve started getting more comfortable with CUDA now, but I’m still no pro, so I might be missing something completely obvious.

I have no idea what “crash” means.

You don’t appear to be doing proper CUDA error checking.

On windows I would be sure to rule out the possibility of a WDDM timeout.

To debug kernel execution issues, I would start with cuda-memcheck

For clarification and elimination purposes I’ll just try to answer your questions before describing what I’ve found with error checking;

When I have the code as in the long example, it finishes in just 15-16ms and the image it produces is completely black. This on its own a indication that it crashes as it should take considerably longer to run, but say I put a printf statement somewhere above line 70, nothing is printed. Say I now replace the last lines with the ones in the 2nd code block I added above, the kernel finishes in 60-70ms and if a printf statement is present, the correct data is printed.

As for WDDM, I’ve that completely disabled that through the Nsight options. When running with Debug Info, the kernel takes 5000ms and produces the correct results, so I don’t think the issue is related to that.

I just tried running with memory checker through VS with Nsight, and no break points were triggered.


Now, I tried putting in the following after the call to cudaDeviceSynchronize(); on line 18:

cudaError_t err = cudaPeekAtLastError();
printf("%s\n", cudaGetErrorString(err));

This printed: “too many resources requested for launch”. So I clearly have some resource issue that running with Debug Info avoids.

  • Compiling with -Xptxas="-v" gives me that the kernel uses 38 registers per thread, so that shouldn't be a problem.
  • ptxas info: 77696 bytes gmem, 72 bytes cmem[3]
    ptxas info: Used 38 registers, 992 bytes cumulative stack size, 376 bytes cmem[0], 28 bytes cmem[2], 1 surfaces
    
  • There also doesn't seem to be any other vialoations that I can see from ptxas
  • I'm running with a block size of 32x32x1=1024 which is at the limit
  • Having 38 registers per thread yields 38*1024=38912 registers per block which is below the limit of 64K for compute 5.0 cards
  • I'm running with a grid size of 60x34x1=2040 which is below the limit
  • The surface bound is 1920x1080 which is below the limit of 65536x32768
  • Upon further error checking, I also get the same error from one of my earlier kernels which I’ve asked about earlier: https://devtalk.nvidia.com/default/topic/1028057/cuda-programming-and-performance/curand_init-sequence-number-problem/post/5229247/?offset=9#5229360. ptxas provides the following for said kernel:

    ptxas info: Used 63 registers, 6624 bytes cumulative stack size, 340 bytes cmem[0]
    
  • 63*1024=64512 which is more than the allowed 64K registers per block
  • After a bit more testing with “HelloCUDA” kernels, I’ve discovered that cudaPeekAtLastError() isn’t cleared when called and no error is present, i.e. it remembers the actual last error. This means that there might not be an issue with the kernel I thought was causing the issue…it might be the curand initializer that is causing the problem when Debug Info isn’t turned on.

    Indeed, when commenting out the part of the Integrate kernel that uses the initalized curand_states, the kernel does not crash anymore. Furthermore, when no longer initializing the curand_states, no error is detected after launching the Integrate kernel.

    I suppose what remains now is to just run the kernel that’s causing the problem multiple times to fix the overuse of registers. I think I’ll manage to get that working, if not…I’ll report back :)

    Thanks for making be ask further questions txbob; “Not all heroes wear capes”

    TLDR; check your errors :)

    Why? Maybe the GPU kernel did not crash, but never ran. “crash” is not synonymous with “any kind of unexpected behavior in code”.

    are you compiling with relocatable device code? you might also want to compare the the ptxas -V output from the (passing) debug case to the (failing) release case

    Well, this is something I’ve concluded with from all the tests I’ve ran to try and figure out where and what might cause the issue. This includes printing something and then stopping the kernel earlier before the point I now know crashed the kernel. What I expected was indeed printed so to me that tells says that some code that is executed later in the kernel causes a crash. I see how “crash” is a bit ambiguous, but I wasn’t able to describe in a better way at the time as I was really confused as to what was going on.

    Neither Debug nor Release is compiling with -rdc=true as far as I can tell.

    The output from release and debug are the same. This makes sense as the curand_state initilization failed in both cases.

    Do you have any idea of why compiling without GPU Debug Info causes an issue when the kernel that is supposed to initialize curandState* curand_states fails, but compiling with GPU Debug Info works fine? Is there some default initialization that occurs under the hood with Debug Info enabled?
    Here’s the code:

    __global__ void CurandSetup(curandState* states, const unsigned long seed, const int width, const int height)
    {
        int x = blockIdx.x * blockDim.x + threadIdx.x;
        int y = blockIdx.y * blockDim.y + threadIdx.y;
        if (x >= width || y >= height) { return; }
        int idx = y * width + x;
        //This is the line that causes the kernel to use 63 registers. 
        //When it's commented out, it only uses 9 and no error is detected when it is run
        //The curandStates are of course then not initialized and cannot be used
        curand_init(seed + idx, 0, 0, &states[idx]); 
    }
    
    curandState* CurandInit()
    {
        curandState* states;
        cudaMallocManaged(&states, DISPLAY_WIDTH * DISPLAY_HEIGHT * sizeof(curandState));
    	
        const int block_x = 32;
        const int block_y = 32;
        const dim3 block(block_x, block_y);
        const int grid_x = std::ceil((float)DISPLAY_WIDTH / block_x);
        const int grid_y = std::ceil((float)DISPLAY_HEIGHT / block_y);
        const dim3 grid(grid_x, grid_y);
        CurandSetup <<<grid, block>>> (states, time(NULL), DISPLAY_WIDTH, DISPLAY_HEIGHT);
        cudaDeviceSynchronize();
    
        return states;
    }