Using unified memory causes system crash

Hello

I am a complete newbie at CUDA, and trying to create my first program with CUDA, using Unified Memory.

However, with the code below, I experience a system crash (or lockup) when I try to read or write previously allocated unified memory from the host.
I added error checks on all CUDA calls, and I don’t get any error results, so I assume the memory allocations with cudaAllocateManaged() calls are OK. But when I try to initialize the arrays, the program will freeze the PC.

I am using VS2015 Community Edition, with a fresh CUDA 9.1 install. Display driver is fully updated.
GPU is a GTX 960M, compute capability 5.0. I also check for this in the code.

The code is below; uncommenting the for-loop causes a crash:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i] + threadIdx.x;
	printf("%d\n", c[i]);
}

int main()
{
	
	cudaError_t cudaStatus = cudaSuccess;
	
	cudaDeviceProp prop;
	cudaGetDeviceProperties(&prop, 0);

	printf("%s\n", prop.name);
	printf("prop.major = %d\n", prop.major);
	printf("prop.minor = %d\n", prop.minor);
	printf("prop.managedMemory = %d\n", prop.managedMemory);

	int *a, *b, *c;
	cudaStatus = cudaMallocManaged(&a, 10 * sizeof(int));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMallocManaged failed!");
		return 1;
	}
	
	cudaStatus = cudaMallocManaged(&b, 10 * sizeof(int));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMallocManaged failed!");
		return 1;
	}

	cudaStatus = cudaMallocManaged(&c, 10 * sizeof(int));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMallocManaged failed!");
		return 1;
	}
	
	/*
	for (int i = 0; i < 10; i++) {
		a[i] = i;
		b[i] = 10 * i;
		c[i] = -1;
	}
	*/
	
	addKernel<<<1, 10>>>(c, a, b);
	cudaStatus = cudaGetLastError();
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "addKernel launch failed!");
		return 1;
	}

	cudaStatus = cudaFree(a);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaFree failed!");
		return 1;
	}

	cudaStatus = cudaFree(b);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaFree failed!");
		return 1;
	}
	
	cudaStatus = cudaFree(c);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaFree failed!");
		return 1;
	}

    return 0;
}

Am I missing something?

add the following after the kernel call

#define CUDACHECK(cmd) do { \
    cudaError_t e = cmd;    \
    if( e != cudaSuccess ) { \
    printf("Failed: Cuda error %s:%d '%s'\n", \
        __FILE__,__LINE__,cudaGetErrorString(e)); \
    exit(EXIT_FAILURE);     \
  } \
} while(0)

CUDACHECK(cudaDeviceSynchronize()));

Hi, thanks for the reply.

Indeed the cudaDeviceSynchronize() call was missing.
But adding this doesn’t fix my problem.

The crash arises when I try to initialize the arrays previously allocated with cudaMallocManaged().
This happens in the for-loop, at line 46. The program executes without a crash when the loop is in comments, but crashes when I uncomment the for-loop.
I get no errors from the allocation calls, or from the kernel call.

I tested a similar program with conventional cuda memory allocation en memcpy calls, and that runs fine. But not when using this code, with the unified memory model.

Regards
Erik

I have the same issue. Whenever I resume my computer from sleep, if I do a write operation on unified memory from the host, the entire system freeze. I am using Windows 7 with Xeon E5-2650 v3 CPU,and Quadro P4000 GPU, In Visual Studio 2017 with CUDA v9.2. And here is my test code:

#include <iostream>
#include <string>

#include <cuda_runtime.h>
#include <device_launch_parameters.h>

#include <cstdio>
#define CheckCudaError(err) _checkCudaError(err, __FILE__, __LINE__)

void _checkCudaError(const cudaError_t& err, const char *fileName = __FILE__, const unsigned int lineNum = __LINE__)
{
	if (err != cudaSuccess)
	{
		std::string errStr = cudaGetErrorString(err);
		std::cout << "CUDA error: " << errStr << ", " << ", in " << fileName << ", line " << lineNum << "\n";
	}
}

int main()
{
	using namespace std;

	const int VEC_SIZE = 10;
	int* pVec0;

	cout << "Allocating Unified cuda memory.\n";
	CheckCudaError(cudaMallocManaged(&pVec0, VEC_SIZE * sizeof(int)));

	CheckCudaError(cudaDeviceSynchronize());

	cout << "Writing into unified memory from host.\n";
	for (int i = 0; i < VEC_SIZE; ++i)
	{
		pVec0[i] = i; // Cause system freeze!!!!
	}

	cout << "Release unified memory.\n";
	CheckCudaError(cudaFree(pVec0));

	CheckCudaError(cudaDeviceSynchronize());

	CheckCudaError(cudaDeviceReset());

	system("pause");

    return 0;
}

Line 34 cause system freeze.

I have the same issue - running Windows 10, VS2017, CUDA 9.2. i8750H processor (with Intel Graphics UHD 630) and GT 1070. I have attempted adding numerous debug points - and I have the same issue; when the first iteration of the loop is executed, my computer freezes-up completely and I have no option but to power down and restart. The freeze is not instantaneous, but will happen within a couple of minutes - however the CUDA app is frozen and the window can not be closed, nor the process killed via Task Manager.

On a few occasions, when lucky enough to be able to catch system before freezing, I have been able to see through Task Manager that System Interrupts goes to 100% - I presume this would be true in all cases.

int N = 1 << 15;
float *x, *y;

cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));

for (int i = 0; i < N; i++) {
	x[i] = 1.0f;
        y[i] = 2.0f;
}

What happens when you add a:

cudaMemset(your_array, 0, N * sizeof(float)); // or sizeof() the type of your arrays
cudaDeviceSynchronize();

before your for-loops?

I have added your suggestion and have not seen any errors when those commands execute.
One thing to add is that following my original post I continued to search for answers/solutions and made one change to my program moving the assignment loop into a kernel:

__global__ void add(int n, float *x, float *y)
{
	int index = threadIdx.x;
	int stride = blockDim.x;
	printf("a%d, ", index);
	for (int i = index; i < n; i+= stride)
		y[i] = x[i] + y[i];
		
}

This change seemed to resolve this issue. I added your suggestion before I called init<<<>>>

cudaMemset(x, 0, N * sizeof(float));
cudaMemset(y, 0, N & sizeof(float));
cudaDeviceSynchronize();
init <<<1, 256 >>>(N, x, y);
cudaDeviceSynchronize();

However, whenever I access the array pointer memory from host code, I experience the system freeze. For example, after executing add<<<>>>, the next step is to validate the add (you may recognize this program from an easy introduction to CUDA).

float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));

When entering the for loop - the application will stop responding on the first iteration. This does not happend 100% of the time. Sometimes the program execution will complete without issue, other’s it will freeze. Let me know if any thoughts or additional suggestions.

Before you continue, please correct this part where I say sizeof(array_len), change to sizeof(float), or sizeof() the type you are using.
I was probably thinking of something and wrote something else.
Change this and try again, see if the program runs without any issues. I will also edit the post so people don’t get wrong information.

I was wondering if you meant float - in any event, I have made the change to the cudaMemset to reflect float and the result is the same. Program will run a few times before freezing up at the maxError loop. Checking nvidia-smi when the program executes successfully, shows allocation of memory and proper deallocation after cudaFree() - I also have program execute a cudaDeviceReset() to verify the memory is freed.

Can you post the complete code here, so I can compile and see what I get?
As for the previous step, we are first initializing the arrays with 0, as it is good programming practice. You already see the benefit, no errors on execution. Make sure you always do a cudaMemset/cudaDeviceSynchronize after the cudaMallocManaged call.
Then, filling the array in a kernel function is also good practice, as it prevents an unnecessary copy from host to device when you start working on it.

Here is the complete code - note there are numerous ‘debug’ output points so I can see where it stops (I know they are ugly, but they came very adhoc as I started to drill down). And thanks in advance - I appreciate you looking at this… Note - when I run this - I will often alternate between executing the compiled build directly and through nvprof to cause the incident to occur. It can take several runs before finally occurring.

#include <cuda_runtime.h>
#include <device_launch_parameters.h>

// Helper functions and utilities to work with CUDA


#include <iostream>
#include <math.h>
#include <string>

void msgout(std::string mymsg);
__global__ void init(int n, float *x, float *y) {

  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;
printf("i%d, ", index);
  for (int i = index; i < n; i += stride) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
}
__global__ void add(int n, float *x, float *y)
{
	int index = threadIdx.x;
	int stride = blockDim.x;
	printf("a%d, ", index);
	for (int i = index; i < n; i+= stride)
		y[i] = x[i] + y[i];
		
}


int main(void)
{
	int N = 1 << 20;
	float *x, *y;
	std::string mystr = "HELLO";

	cudaError_t cudaStatus;
	cudaStatus = cudaMallocManaged(&x, N * sizeof(float));
std::cout << "Cuda Status = " << cudaStatus << std::endl;
	cudaStatus = cudaMallocManaged(&y, N * sizeof(float));
std::cout << "Cuda Status = " << cudaStatus << std::endl;
std::cout << "The size of x is: " << sizeof(x) << std::endl;
std::cout << "The size of N & sizeof(x) is: " << N*sizeof(x) << std::endl;

cudaMemset(x, 0, N * sizeof(float));
cudaMemset(y, 0, N & sizeof(float));
cudaDeviceSynchronize();

	msgout("Starting array assignment");
	init <<<1, 256 >>>(N, x, y);
	cudaDeviceSynchronize();
msgout("\nCompleted array assignment\n");

	add <<<1, 256 >>> (N, x, y);
		cudaDeviceSynchronize();
msgout("\nReturned from Add!\n");


	float maxError = 0.0f;
	for (int i = 0; i < N; i++)
		maxError = fmax(maxError, fabs(y[i] - 3.0f));
	std::cout << "\nMax error: " << maxError << std::endl;
std::cout << "\nrun SMI utility and hit enter here: \n";
std::cin.get();

	cudaFree(x);
	msgout("Freed X");
	cudaDeviceSynchronize();
	cudaFree(y);
	cudaDeviceSynchronize();
	msgout("Freed Y");
	
std::cout << "run SMI utility and hit enter here: \n";
std::cin.get();

cudaDeviceReset();
std::cout << "CUDA Device RESET: \nrun SMI utility and hit enter here: \n";
std::cin.get();
	return 0;
}

void msgout(std::string mymsg)
{
std::cout << mymsg << std::endl;
return;
}

I don’t get the freeze you mention but there are a few things that I fixed even before I ran it the first time.

  • Lines 14, 15, 24 and 25: why are you calculating the thread index and the stride differently in these functions? Use the form in 14 and 15 to ensure you have grid strided loop as it will work for whatever array size and number of blocks/threads you have. Keep your code consistent
  • Line 44: by sizeof(x), did you really mean it, as it will always be 8 bytes for a 64bit machine, or you meant the length of x, which is N, or the allocation size, which is N * sizeof(float)?
  • Line 45: by consequence, this one doesn’t give the result you expect (should be half of what it shows)
  • Line 48: you have a & when it should be * (I don’t even know how it doesn’t issue a warning in compilation)
  • Lines 70 and 72: you don’t need cudaDeviceSynchronize() after freeing an array. No synchronization is needed after you deallocate a resource

Change the program, specially lines 48, 70 and 72, and see if runs without any uncommon interruption.

Thanks - so as I mentioned, this came from https://devblogs.nvidia.com/even-easier-introduction-cuda/
the second example of add_block.cu (Like the original poster, I am new to CUDA).

The change I made, through a separate nVidia dev blogpost was to move the assignment loop into a kernel. I do see what you are talking about with the difference in the stride calculation. However, my original problem came out of the code sample with the assignment in host code. I was hoping that perhaps there was something with the unified memory model - in fact, I had read some posts that mentioned potential issues with clock speed differences - and so have been looking for anything to help cross the barrier with this issue.

I will apply your suggestions and test again to see if any results change within my environment.

Again - thanks for your help! The program, do to numerous changes in attempt to fix/identify the error got a little out of control - thanks for your suggestions. I have fixed the parts you mentioned and actually restored the program to it’s original state - altering only your suggestions on stride and cudMemSet, a couple of diagnostic outputs, as well as fixing some of the typos. I still have the error but I thank you so much for your input, comments, and help! There must be something specific to my PC build, either hardware or software that is causing an inconsistency somewhere. All drivers up to date, so at least not a legacy conflict.

I am reposting the cleaned/restored code below just for reference - not with an expectation of you providing any further checks, but just in case you want to see the ‘original’ and in case you happen to see anything I missed. Again, thanks for your time with this!

PS - I forgot to mention one other useful data point. On a very rare occasion, the computer will BSOD - Video_TDR_Failure - nvlddmkm.sys

#include <cuda_runtime.h>
#include <device_launch_parameters.h>

#include <iostream>
#include <math.h>
#include <string>

__global__ void add(int n, float *x, float *y)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;
	for (int i = index; i < n; i+= stride)
		y[i] = x[i] + y[i];
}

int main(void)
{
	int N = 1 << 20;
	float *x, *y;

	cudaMallocManaged(&x, N * sizeof(float));
	cudaMallocManaged(&y, N * sizeof(float));

cudaMemset(x, 0, N * sizeof(float));
cudaMemset(y, 0, N * sizeof(float));
cudaDeviceSynchronize();

		std::cout << "Begin populating arrays!\n";
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;  // THIS IS WHERE I EXPERIENCE A SYSTEM-FREEZE 
    y[i] = 2.0f;
  }

	add <<<1, 256 >>> (N, x, y);
		cudaDeviceSynchronize();
std::cout << "Returned from Add!\n";
		
	float maxError = 0.0f;
	for (int i = 0; i < N; i++)
		maxError = fmax(maxError, fabs(y[i] - 3.0f));   // ANOTHER FREEZE POINT
	std::cout << "\nMax error: " << maxError << std::endl;

	cudaFree(x);
	cudaFree(y);

cudaDeviceReset();
	return 0;
}

I don’t have a dev environment right now but can’t see why this code above would cause trouble.
Look for the macro __CUDA_SAFE_CALL and wrap your API calls with it, if there is any error during the call, it will show you the error and the line. This way you don’t have to write debug code along your main program, the macro does it for you.

Also have a look at Njuffa’s macro to catch kernel launch errors:
https://devtalk.nvidia.com/default/topic/545591/cuda-programming-and-performance/how-to-debug-kernel-throwing-an-exception-/2

So you will have 2 macros: 1 for CUDA API calls and 1 for kernel calls. If there is a problem, they will tell you what line. But this BSOD is suspect. You’ve reinstalled drivers and stuff, I think…

Thanks again for all your help. I have ensured all drivers are up to date and over the weekend I have been rebuilding my laptop numerous times trying to discover where the issue may be coming from as I believe it was either a hardware fault or a driver conflict.

Currently I have a base build (with older drivers, in some cases) - I have run the application successfully numerous times from the precompiled .exe with no errors so far. I have not loaded Visual Studio, CUDA, or anything else. Once I am confident that the system is stable as-id, I’ll begin to add the additional components and see where I run in to trouble.

Will follow-up with my results.

I’m having the same issue. When I try to access a managed pointer directly, the system crashes (possibly a watchdog timing out, but I’m only allocating a few KB. Nothing too complex). Per saulocpp’s suggestion, I replaced my loops that assign values with cudaMemset() and added a cudaDeviceSyncronize() call afterwards. This seems to fix the assigning portion of the code, but then the error check part causes my system to crash, again when referencing the managed memory. I have tried wrapping references in cudaDeviceSyncronize() calls with no avail.

I am on VS2017 Community with CUDA 9.2 running a GTX1070 with Windows 10.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <math.h>
#include <time.h>
#include <stdlib.h>  // used for rand()

#include "cuda_err_check.h"

//============================= CPU ===========================================
/*
This function is called by the CPU
*/
/*
 * function: add_vec
 * purpose: add two vectors on CPU
 * PARAMETERS:
 *  a - first array
 *  b - second array
 *  c - output array
 *  N - size of the array
 */
void cpu_add(float *a, float *b, float *c, int N) {
	for (unsigned int i = 0; i < N; i++) {
		c[i] = a[i] + b[i];
	}
}

//============================= GPU Kernel ====================================
/*
 * function: add_vec
 * purpose: add two vectors on GPU
 * PARAMETERS:
 *  a - first array
 *  b - second array
 *  c - output array
 */
__global__ void cuda_add(float *a, float *b, float *c, unsigned long long N) {
	// assign tid by using block id, block dimension, and thread id
	unsigned long long tid = blockIdx.x * blockDim.x + threadIdx.x;

	// stride is for big arrays, i.e. bigger than threads we have
	unsigned long long stride = blockDim.x * gridDim.x;

	// do the operations
	while (tid < N) {
		c[tid] = a[tid] + b[tid];
		tid += stride;
	}
}

int main(){
	// size of the array
        const unsigned long long N = 10;
	printf("Array Size: %d\n", N);

	// DEVICE PROPERTIES
	cudaDeviceProp prop;
	cudaGetDeviceProperties(&prop, 0); 

        // print a couple of many properties
	printf("Max Threads per block: %d\n", prop.maxThreadsPerBlock);

	printf("Max Grid Size: %d x %d x %d\n", prop.maxGridSize[0],
		prop.maxGridSize[1], prop.maxGridSize[2]);

	// using max number of threads in the x dim possible
	int nThreads = prop.maxThreadsDim[0];
	printf("nThreads: %d\n", nThreads);

	// calculate number of blocks based on the number of threads
	int nBlocks = (N + nThreads - 1) / nThreads;
	printf("nBlocks: %d\n", nBlocks);

	// allocate memory
	float *a, *b, *c;
	CudaSafeCall( cudaMallocManaged(&a, sizeof(float) * N) );
	CudaSafeCall( cudaMallocManaged(&b, sizeof(float) * N) );
	CudaSafeCall( cudaMallocManaged(&c, sizeof(float) * N) );
	printf("Allocated memory on the Device for a, b, and c . . .\n");
	
	CudaSafeCall(cudaDeviceSynchronize());

	// THIS DOES NOT WORK, CAUSES SYSTEM TO CRASH
	// create vectors.
	for (unsigned long long i = 0; i < N; i++) {
		// actual values don't matter, as long as they're floats.
		a[i] = 1.0f;
		b[i] = 2.0f;
	}
	printf("Done assigning values.\n");

	// THIS DOES WORK, CRASH NOW OCCURS IN ERROR CHECK SECTION
	//CudaSafeCall( cudaMemset(a, 0, N * sizeof(float)) );
	//CudaSafeCall( cudaMemset(b, 0, N * sizeof(float)) );

	CudaSafeCall( cudaDeviceSynchronize() );

	printf("Running on GPU\n");

	// run the kernel
	cuda_add<<<nBlocks, nThreads>>>(a, b, c, N);
	CudaCheckError();

	// wait for device to finish
	CudaSafeCall( cudaDeviceSynchronize() );

	// calculate the error.
	float maxError = 0.0f;
	printf("Testing for errors . . . \n");
	for (unsigned long long i = 0; i < N; i++) {
		maxError = abs(c[i] - a[0] - b[0]);
	}

	printf("Max error: %f\n", maxError);

	printf("Running on CPU\n");

	cpu_add(a, b, c, N); // add the vectors

	printf("Done!\n");
	CudaSafeCall( cudaFree(a) );
	CudaSafeCall( cudaFree(b) );
	CudaSafeCall( cudaFree(c) );
	// ============== END ==================
        return 0;
}

Matt_Hanley, right after you allocate your managed 3 arrays, initialize them in the form:

CudaSafeCall(cudaMemset(array, 0, N * sizeof(float)));

Exactly as you have commented out along the code. This saves you from many debugging hours, all sorts of things can happen when we don’t initialize variables.
I don’t have a dev environment right now, but the crash you experience may be due to some other error and not necessarily at the line you point out. It happens a lot.

When using managed arrays, always place a cudaDeviceSynchronize() after any cuda API call that modifies the array, such as cudaMemset or a kernel that writes to a managed array.
Your code will look like:

cudaMallocManaged(&a, sizeof(float) * N);
cudaMallocManaged(&b, sizeof(float) * N);
cudaMemset(a, 0, sizeof(float) * N);
cudaMemset(b, 0, sizeof(float) * N);
cudaDeviceSynchronize();

kernel_init_array <<< gridSize, blockSize >>> ();
cudaDeviceSynchronize();

// If initializing the array in CPU (slower), no need to cudaDeviceSynchronize

kernel_add <<< gridSize, blockSize >>> (arguments);
cudaDeviceSynchronize();

// Do something else with the arrays, like printing

cudaFree(a);
cudaFree(b);

Also run cuda-memcheck your_prog, so it spots memory errors, arrays going out of bounds, then see what you get.

Thanks for your response, saulocpp! Sorry for this long post. I’m really hoping to get to the bottom of this issue. Also note that this code runs fine on Linux systems (specifically the Xavier dev kit).

Here is my effort to track down the errors.

This works great, no problems running this to the end:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <math.h>
#include <time.h>
#include <stdlib.h>  // used for rand()

#include "cuda_err_check.h"


//============================= GPU Kernels ====================================
/*
The "__global__" tag tells nvcc that the function will execute on the device
but will be called from the host. Notice that we must use pointers!
*/
/*
 * function: add_vec
 * purpose: add two vectors on GPU
 * PARAMETERS:
 *  a - first array
 *  b - second array
 *  c - output array
 */
__global__ void cuda_add(float *a, float *b, float *c, unsigned long long N) {
	// assign tid by using block id, block dimension, and thread id
	unsigned long long tid = blockIdx.x * blockDim.x + threadIdx.x;

	// stride is for big arrays, i.e. bigger than threads we have
	unsigned long long stride = blockDim.x * gridDim.x;

	// do the operations
	while (tid < N) {
		c[tid] = a[tid] + b[tid];
		tid += stride;
	}
}


int main() {
	// need to "wake up" the api. This adsorbs the startup overhead that was
	// biasing my results
	cudaFree(0);

	// size of the array
	const unsigned long long N = 10000;
	printf("Array Size: %d\n", N);

	// DEVICE PROPERTIES
	cudaDeviceProp prop;
	cudaGetDeviceProperties(&prop, 0);

	// print a couple of many properties
	printf("Max Threads per block: %d\n", prop.maxThreadsPerBlock);

	// using max number of threads in the x dim possible
	int nThreads = prop.maxThreadsDim[0];
	printf("nThreads: %d\n", nThreads);

	// calculate number of blocks based on the number of threads
	int nBlocks = (N + nThreads - 1) / nThreads;
	printf("nBlocks: %d\n", nBlocks);

	float *a, *b, *c;

	// allocate memory once before the iterations
	CudaSafeCall( cudaMallocManaged(&a, sizeof(float) * N) );
	CudaSafeCall( cudaMallocManaged(&b, sizeof(float) * N) );
	CudaSafeCall( cudaMallocManaged(&c, sizeof(float) * N) );

	CudaSafeCall(cudaDeviceSynchronize());



	printf("Running on GPU\n");

	// run the kernel
	cuda_add <<< nBlocks, nThreads >>> (a, b, c, N);

	CudaCheckError();

	// wait for device to finish
	CudaSafeCall(cudaDeviceSynchronize());


	CudaSafeCall( cudaFree(a) );
	CudaSafeCall( cudaFree(b) );
	CudaSafeCall( cudaFree(c) );
	// ============== END ==================
	return 0;
}

However, when I add code in to assign values to variables a and b, the system crashes. This chunk of code is added after the cudaMallocManaged() calls and the cudaDeviceSynchronize() call.

// create vectors.
	for (unsigned long long i = 0; i < N; i++) {
		// actual values don't matter, as long as they're floats.
		a[i] = 1.0f;
		b[i] = 2.0f;
	}

	CudaSafeCall(cudaDeviceSynchronize());

So then, as suggested, I replace the assignments with calls to cudaMemset(). This allows the program to run again:

CudaSafeCall( cudaMemset(a, 0, N * sizeof(float)) );
	CudaSafeCall( cudaMemset(b, 0, N * sizeof(float)) );
	CudaSafeCall( cudaDeviceSynchronize() );

Now I’d like to check for errors in the adding kernel. Back in the main function, after the kernel call and a subsequent call to cudaDeviceSynchronize(), I add a simple check of the first element that causes the system to crash once again:

printf("Error: %f\n", (c[0] - a[0] + b[0]));

I then became curious about the pointer itself. Here is my attempt to print the pointer address for nBlocks, a, b, and c. It is very possible the rest of this post is irrelevant to diagnosing the problem.

printf("Pointer to nThreads: 0x%p\n", &nThreads);
	printf("Pointer to nBlocks: 0x%p\n", &nBlocks);
	printf("Pointer to a: 0x%p\n", a);
	printf("Pointer to b: 0x%p\n", b);
	printf("Pointer to c: 0x%p\n", c);

This yields the following output:

Pointer to nThreads:     0x0000009821EFF944
Pointer to nBlocks:      0x0000009821EFF964
Pointer to a: 		 0x0000000701400000
Pointer to b: 		 0x0000000701410000
Pointer to c: 		 0x0000000701420000

This made me curious about the pointers. I therefore got the pointer attributes using the CUDA API function cudaPointerAttributes() https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaPointerAttributes.html#structcudaPointerAttributes.

This gives me the following information:

memoryType		cudaMemoryTypeDevice (2)	cudaMemoryType
device			0				int
devicePointer	        0x0000000705600000		void *
hostPointer		0x0000000705600000		void *
isManaged		1				int

What this tells me is that the host pointer and the device pointer are the same. From the Unified Addressing documentation, “For these devices there is no distinction between a device pointer and a host pointer – the same pointer value may be used to access memory from the host program and from a kernel running on the device.” I have verified that unified memory is enabled on my device. Reading that documentation tells me that “All host memory allocated through all devices using cudaMallocHost() and cudaHostAlloc() is always directly accessible from all devices that support unified addressing.” Using this information, I allocated memory with cudaMallocHost() instead of cudaMallocManaged(). My program than ran fine. Here is the pointer information from cudaPointerAttributes():

memoryType		cudaMemoryTypeHost (1)	cudaMemoryType
device			0			int
devicePointer	        0x0000000204a00000	void *
hostPointer		0x0000000204a00000	void *
isManaged		0			int

This is interesting because the pointer information now says that “isManaged=0,” however my program runs fine with zero computation errors.

Finally for completeness, I used cudaMalloc() to get device pointers. cudaPointerAttributes() yields:

memoryType		cudaMemoryTypeDevice (2)	cudaMemoryType
device			0				int
devicePointer	        0x0000000705600000		void *
hostPointer		0x0000000000000000		void *
isManaged		0				int

This is how I would expect it to look. The hostPointer is null and “isManaged=0.”

So now my questions are the following:

  1. Why does cudaMallocManaged() return significantly different host pointers than cudaMallocHost?
  2. Why would accessing a pointer location cause my system to crash?
  3. Any idea how I can diagnose this problem further? Can't really do too much debugging when my computer freezes
  4. Does using cudaMallocHost() with a unified memory enabled device work the same as cudaMallocManaged()? I.e. can I bypass using cudaMallocManaged() by using cudaMallocHost()?

One more piece of information: the code will run to completion after a system reboot and fresh build. However, it will not run to completion a second time. It just freezes with the same symptoms as mentioned before.

Define “significantly different”. How does it matter to user code whether the pointers returned by the two APIs are “significantly different”?