Hi!
I’ve been having a problem with my main kernel that started when I tried running it in Release mode in Visual Studio. Some basic info:
- Visual Studio 2017 v15.2
- Cuda Toolkit 9.0
- Nsight 5.6
- Driver: 388.71
- GTX 850M
Here is the code that launches the kernel and the kernel itself:
//includes...
static surface<void, cudaSurfaceType2D> cuda_surface;
static const int NUM_SAMPLES = 10;
static const int MAX_BOUNCES = 1;
__global__ void Integrate(World* world, Camera* camera, curandState* curand_states);
void PathTrace(SDL_Window* window, World* world, Camera* camera, curandState* curand_states)
{
const unsigned int block_x = 32;
const unsigned int block_y = 32;
const dim3 block(block_x, block_y);
const unsigned int grid_x = std::ceil((float)DISPLAY_WIDTH / block_x); //1920
const unsigned int grid_y = std::ceil((float)DISPLAY_HEIGHT / block_y); //1080
const dim3 grid(grid_x, grid_y);
cudaBindSurfaceToArray(cuda_surface, GLCUDAGetCudaArray());
Integrate <<<grid, block >>>(world, camera, curand_states);
cudaDeviceSynchronize();
DisplayUpdate(window, GLCUDAGetTexture());
}
__global__ void Integrate(World* world, Camera* camera, curandState* curand_states)
{
const int y = blockIdx.y * blockDim.y + threadIdx.y;
const int x = blockIdx.x * blockDim.x + threadIdx.x;
if (y >= camera->lens_height || x >= camera->lens_width) { return; }
const int idx = y * camera->lens_width + x;
const int texture_y = (y + 1 - camera->lens_height) * (-1);
Ray ray;
camera->GenerateRay(x, y, &ray);
Spectrum L(0.0f);
SurfaceInteraction isect_first;
bool found_isect_first = world->Intersect(isect_first, ray);
int num_samples = 0;
if (found_isect_first)
{
for (int samples = 0; samples < NUM_SAMPLES; samples++)
{
Spectrum beta(1.0f); //Remember: beta=path throughput weight p.876
float u_light[2];
SampleGet2DArray(&curand_states[idx], u_light);
glm::vec3 wi;
float light_pdf = 0.0f;
Spectrum Li = world->point_light->SampleLi(isect_first, u_light, &wi, &light_pdf); //Irradience emitted by light
Spectrum f = isect_first.mesh->lam_ref->f(isect_first.wo, wi) * glm::abs(glm::dot(wi, isect_first.normal)); //BRDF * cos(theta)
Ray vis_ray(isect_first.point, world->point_light->position - isect_first.point);
bool light_visible = !world->Intersect(vis_ray, glm::length(world->point_light->position - isect_first.point));
if (!light_visible)
{
Li = Spectrum(0.0f);
}
Spectrum Ld = f * Li / light_pdf;
L += beta * Ld;
num_samples++;
}
}
else
{
//Early exit -> first ray didn't intersect with scene, i.e. none will ever
num_samples = 1;
}
L /= num_samples;
L = glm::clamp(L, 0.0f, 1.0f); //Make sure that we stay in the range [0, 255] because of uchar
/*L.x = curand_uniform(&curand_states[idx]);
L.y = curand_uniform(&curand_states[idx]);
L.z = curand_uniform(&curand_states[idx]);*/
surf2Dwrite(make_uchar4(L.x * 255, L.y * 255, L.z * 255, 255), cuda_surface, x * sizeof(uchar4), texture_y);
}
So, the issue occurs when L’s members are “read” individually and used for some calculation like the one on line 71. L is basically the amount of light for the current pixel and each thread has its own instance of it. However, if L’s members are given new values(say using the commented out code on 68-70), reading the values of L’s members doesn’t crash the kernel, and the data in array cuda_surface receives correct data. E.g. this also works:
//Generate completely red image
L.x = 1.0f;
L.y = 0.0f;
L.z = 0.0f;
Something simpler than line 71 also fails:
if (x == 0 && y == 0) //Just do this for the first thread - just for testing
{
printf("%f\n", L.x); //Fails
}
Now, this is in Release mode in Visual Studio, running with 02 optimizations and no debug info for neither the CPU nor GPU. When running in Debug mode, the above code doesn’t crash and produces a correct image. Turning off optimizations in Release mode didn’t resolve the issue, so it doens’t appear to be an issue linked to the optimized code that is produced. What “fixes” the issue is turning on “Generate GPU Debug Info” for Release. The code now runs at the same pace as the Debug version, but works.
I’ve now spent many hours searching around to find any relevant information that could aid me, but have come up empty handed as far I can tell…
Any ideas and input would be great! I’ve started getting more comfortable with CUDA now, but I’m still no pro, so I might be missing something completely obvious.