Memcheck bugs (Windows)

mathijs727 · March 1, 2017, 8:45pm

Hi there,

I’ve been using the memcheck tool while developing a ray tracer from scratch.
This early on, Im already getting stuck because I keep getting random errors from the memcheck tool.
Im trying to ray trace a hardcoded (in kernel code) sphere using a moveable camera…

__global__ void cudaRender(
	float4* screen,
	int pitch,
	const Camera::CameraData* camera,
	MyStruct* list)
{
	// Calculate the global x and y coordinates of this thread + its global (1 dimensional) id.
	int x = blockIdx.x * blockDim.x + threadIdx.x;
	int y = blockIdx.y * blockDim.y + threadIdx.y;
	int gid = y * pitch + x;

	// Generate primary ray
	glm::vec2 pixelPosition(x, y);
	Ray ray = Camera::generateRay(*camera, pixelPosition);

	// Setup hardcoded sphere for the demo
	Sphere demoSphere;
	demoSphere.position = glm::vec3(0, 0, -5);
	demoSphere.radius2 = 4.0f;

	// Intersect with the sphere, if we hit we set the output colour to  green
	glm::vec3 outCol = glm::vec3(0);
	if (demoSphere.intersect(ray))
		outCol = glm::vec3(0, 1, 0);

	// Write the output color to the screen buffer
	screen[gid].x = outCol.x;
	screen[gid].y = outCol.y;
	screen[gid].z = outCol.z;
	screen[gid].w = 1.0f;
}

// A GPU function that generates a ray for a given pixel
__device__ Ray Camera::generateRay(const CameraData& data, glm::vec2 pixelPosition)
{
	// The camera's u, v vectors span the virtual screen. Their length is depends
	//  on the aspect ratio and Field of View.
	// Calculate the point on the virtual screen through which the ray passes
	// See slides 17-20 of:
	// http://www.cs.uu.nl/docs/vakken/magr/2016-2017/slides/lecture%2001%20-%20intro%20&%20whitted.pdf
	glm::vec2 uv = (pixelPosition / data.screenSize) * 2.0f - 1.0f;
	glm::vec3 p = data.screenCenter + uv.x * data.u + uv.y * data.v;

	Ray primaryRay;
	primaryRay.origin = data.cameraPosition;
	primaryRay.direction = glm::normalize(data.cameraPosition - p);
	primaryRay.t = 10e+10f;// std::numeric_limits is not supported on device code

	return primaryRay;
}

__device__ bool Sphere::intersect(Ray& ray)
{
	// Efficient ray/sphere intersection code by Jacco Bikker.
	// Presented during the first lecture (slide 21) of Advanced Graphics course (2016-2017)
	//  at the Utrecht University):
	// http://www.cs.uu.nl/docs/vakken/magr/2016-2017/index.html
	// http://www.cs.uu.nl/docs/vakken/magr/2016-2017/slides/lecture%2001%20-%20intro%20&%20whitted.pdf
	glm::vec3 center = position - ray.origin;// Vector from ray origin to sphere center
	float t = glm::dot(center, ray.direction);// Projection of center vector onto the ray direction
	glm::vec3 q = center - t * ray.direction;// Component of center vector orthogonal to the ray direction
	float p2 = glm::dot(q, q);// Length of q square
	if (p2 > radius2)
		return false;
	t -= sqrtf(radius2 - p2);
	if ((t < ray.t) && (t > 0))
	{
		ray.t = t;
		return true;
	}
	return false;
}

struct CameraData
{
	glm::vec3 cameraPosition;// "eye" vector (camera's position in world space)
	glm::vec3 screenCenter;// Center of the screen plane in world space
	glm::vec3 u, v;// Vectors that define the virtual screen plane in world space
	glm::vec2 screenSize;// Screen size in pixels
};

The program runs fine without memcheck enabled (no freezes/system crashes).
But when I enable memcheck it keeps giving me errors in random locations in the math (glm vector math) code (both intersection as well as ray generation).
These errors are not thrown at the first invocation of the kernel but sometimes appear only after a couple of seconds.
The camera data is read only (using const pointers) and is only changed in between kernel invocations using cudaMemcpy.
Im almost a 100% sure this is a bug in the memcheck tool as some of the messages really make no sense at all:

Some of the error messages are:

Memory Checker detected 64 access violations.
error = misaligned load (global memory)
gridid = 1
blockIdx = {20,124,0}
threadIdx = {0,0,0}
address = 0x1b8e1fffc60
accessSize = 4

Memory Checker detected 31 access violations.
error = access violation on store (shared memory)
gridid = 31
blockIdx = {9,160,0}
threadIdx = {1,0,0}
address = 0x1dc00fffc40
accessSize = 4

Both of which make no sense to me at all because 0x1b8e1fffc60 is aligned to 32 bytes (and a glm::vec3 are separate floats anyways).
Second of all, Im not using shared memory anywhere in my code so how can I ever get an access violation in shared memory?
The memcheck tool breaks in the glm/helper_math (tried both) operator overloads.
In the case of the second error, according to VS watch window, all variables in the function are locals (and none are in shared memory).

Even when I changed my code to only upload the camera data once at the start, I got the same errors.
This is really weird since its basically doing the same calculations every frame, which means it should either crash at the first invocation, or not at all.
For math Im using glm although I did switch to float2/3/4 with helper_math.h (from the CUDA samples) but this didnt help either.

I reinstalled the driver, cuda toolkit and nsight visual studio plugin.
Driver: 378.66
Nsight: 5.2.0.16268
CUDA: 8.0.61
OS: Windows 10 Pro
GPU: GTX1050

This problem also occurs on my laptop (GTX1050 as well).

harryz · March 3, 2017, 3:54am

Hi,

Could you use the cuda-memcheck not the memcheck in nsight to check your app if the same issue still exists?

Best Regards

mathijs727 · March 3, 2017, 8:13am

Dear Harry,

Thanks for the reply.
I tested the command line tool and it works as expected (no errors).
To make sure its working correctly, I intentionally added an out-of-bounds indexing and that error got reported like is expected.

So it seems this bug is only present in Visual Studio with Nsight, but not in the command line tool.
Just for reference, I used this code (wrote it for the bug report) to test the problem, which is as short as I could get it (helper_math.h is from the CUDA SDK samples):

#include "cuda_runtime.h"
#include "cuda_runtime_api.h"
#include "device_launch_parameters.h"
#include "helper_math.h"
#include <iostream>

struct Ray
{
	float3 origin;
	float3 direction;
	float t;// Distance to closest hit
};

struct Sphere
{
	float3 position;// Position in world space
	float radius2;// Radius square
};

__device__ bool intersectSphere(Ray& ray, const Sphere& sphere)
{
	// Efficient ray/sphere intersection code by Jacco Bikker.
	// Presented during the first lecture (slide 21) of Advanced Graphics course (2016-2017)
	//  at the Utrecht University):
	// http://www.cs.uu.nl/docs/vakken/magr/2016-2017/index.html
	// http://www.cs.uu.nl/docs/vakken/magr/2016-2017/slides/lecture%2001%20-%20intro%20&%20whitted.pdf
	float3 center = sphere.position - ray.origin;// Vector from ray origin to sphere center
	float t = dot(center, ray.direction);// Projection of center vector onto the ray direction
	float3 q = center - t * ray.direction;// Component of center vector orthogonal to the ray direction
	float p2 = dot(q, q);// Length of q square
	if (p2 > sphere.radius2)
		return false;
	t -= sqrtf(sphere.radius2 - p2);
	if ((t < ray.t) && (t > 0))
	{
		ray.t = t;
		return true;
	}
	return false;
}

struct CameraData
{
	float3 cameraPosition;// "eye" vector (camera's position in world space)
	float3 screenCenter;// Center of the screen plane in world space
	float3 u, v;// Vectors that define the virtual screen plane in world space
	float2 screenSize;// Screen size in pixels
};

__global__ void kernel(
	float4* output,
	const CameraData* input)
{
	int x = blockIdx.x * blockDim.x + threadIdx.x;
	int y = blockIdx.y * blockDim.y + threadIdx.y;
	int gid = y * (gridDim.x * blockDim.x) + x;

	// Construct a ray using all attributes of the input data. These values make no sense but this is just for debugging purposes.
	Ray ray;
	ray.origin = input->u;
	ray.direction = input->v;
	ray.origin -= input->cameraPosition;
	ray.direction += input->screenCenter;
	ray.t = 123 - input->screenSize.x + input->screenSize.y;

	Sphere sphere;
	sphere.radius2 = 1.0f;
	sphere.position = make_float3(0, 0, 3);

	float4 outCol = make_float4(0);
	if (intersectSphere(ray, sphere))
		outCol = make_float4(1);

	output[gid] = outCol;
}

int main()
{
	int width = 1280;
	int height = 720;

	CameraData* inputDevicePtr;
	CameraData inputHostStruct;
	inputHostStruct.cameraPosition = make_float3(0);
	inputHostStruct.screenCenter = make_float3(0, 0, 1);
	inputHostStruct.screenSize = make_float2(width, height);
	inputHostStruct.u = make_float3(1, 0, 0);
	inputHostStruct.v = make_float3(0, 1, 0);
	cudaMalloc(&inputDevicePtr, sizeof(CameraData));
	cudaMemcpy(inputDevicePtr, &inputHostStruct, sizeof(CameraData), cudaMemcpyHostToDevice);

	float4* outputDevicePtr;
	cudaMalloc(&outputDevicePtr, width * height * sizeof(float4));

	float4* outputHostPtr = new float4[width * height];

	int iterations = 10;
	for (int i = 0; i < iterations; i++)
	{
		dim3 blocksPerGrid(width / 32, height / 2);
		dim3 threadsPerBlock(32, 2);
		kernel<<<blocksPerGrid, threadsPerBlock>>>(outputDevicePtr, inputDevicePtr);

		// Do something with the result...
		cudaMemcpy(outputHostPtr, outputDevicePtr, sizeof(float4) * width * height, cudaMemcpyDeviceToHost);

		std::cout << "Succesfully performed a render iteration" << std::endl;
	}

	cudaFree(outputDevicePtr);
	cudaFree(inputDevicePtr);
	delete[] outputHostPtr;
}

Kind regards,
Mathijs Molenaar

harryz · March 6, 2017, 2:50am

Internal bug has be submitted.