Whole system freezes when using cudaMallocManaged

Hi,

I am using CUDA 10 with Visual Studio 2017 (15.8.9, latest available now). The GPU is a GTX 1050 Ti Max-Q in a laptop. When I create a new CUDA project I get the simple “addWithCuda” example that adds two vectors together. The example code uses cudaMalloc to allocate memory and then it copies the two vectors from host memory to device memory. If I compile and run that, it executes fine.

However, if I change the calls to cudaMalloc to calls to cudaMallocManaged, and I replace cudaMemcpy with simple memcpy, when I run the program my whole system freezes and I have to hard reboot it. This doesn’t happen always, e.g. I may be able to run the program twice and then at the third run the freezing behavior occurs (same exact executable, with no recompilation in between).

Am I doing something wrong here? And even if I am, is it normal that this freezes the whole system to the point that I have to reboot it?

Thanks!

Remove the memcpy calls from the code.
Memory allocated with cudaMallocManaged is automatically moved between gpu and host depending on where you reference it from.

I’m using memcpy to actually copy data in the memory, not to move data from host to device and viceversa.

E.g.

int size = 5;
int testVector[] = {1,2,3,4,5}.

int* newMem = nullptr;
cudaStatus = cudaMallocManaged((void**)&newMem, size * sizeof(int));
if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMallocManaged failed!");
		goto Error;
}

memcpy(newMem, testVector, size * sizeof(int));

A few observations:

  • You don’t need memcpy in this particular case because you can totally remove testVector from the code and instead just use newMem. Since it is allocated with cudaMallocManaged, it can be directly accessed from the host or from the gpu without further copy. After allocating the memory, just assign whatever values you want to this array;
  • Remove this void** cast from cudaMallocManaged call;
  • Don’t use memcpy from the host to copy stuff between host and device, memcpy doesn’t know about the memory addressing in the device side. For this, you have to use cudaMemcpy, and because you are using cudaMallocManaged, the step of copying data between host and device is already taken care. You can, however, use memcpy from within a kernel function because now it is the device calling its own memcpy function that is aware of the memory addressing in the device.

Try this instead:

int length = 5, *managed_array;

cudaMallocManaged(&managed_array, length * sizeof(int));    // Allocates managed memory for 5 int elements
cudaMemset(managed_array, 0, length * sizeof(int));         // Initializes the memory with 0

for(int i = 0; i < length; i++);                            // Assigns some values to the array in host side
    managed_array[i] = i;

// Call your kernel function to do something
// cudaDeviceSynchronize();

// Do something else with managed_array in host side if you want

cudaFree(managed_array);                                    // We are done, deallocate the memory

Add your error checking to the code, I omitted it for simplicity.

Thanks for the input, I have reduced my code a minimal contained example that still exhibits the freezing behavior:

__global__ void testKernel(int* arr)
{
	int i = threadIdx.x;
	arr[i] = arr[i] * 2;
}

int main()
{
	int length = 5, *managed_array;

	cudaCheckError(cudaMallocManaged(&managed_array, length * sizeof(int))); // Allocates managed memory for 5 int elements
	cudaCheckError(cudaMemset(managed_array, 0, length * sizeof(int))); // Initializes the memory with 0

	for (int i = 0; i < length; i++) // Assigns some values to the array in host side
		managed_array[i] = i;

	testKernel<<<1, length>>> (managed_array);

	cudaCheckError(cudaDeviceSynchronize());

	for (int i = 0; i < length; ++i)
	{
		printf("%d ", managed_array[i]);
	}

	cudaCheckError(cudaFree(managed_array));    // We are done, deallocate the memory

	return 0; 
}

The macro cudaCheckError simply exits if the return code is not success.

I’ve been seeing other posts in the forum lately with seemingly the same problem - if the code above should work ok, maybe I should report a bug.

I think reporting a bug is a good idea. If you file a bug report, I’d be interested to know the bug number, for reference.

Thank you. Reported with bug ID #2439924.

Does the program below at least show the name of your card without any error?

#include <iostream>

int main(void)
    {
    cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, 0);    // Assuming you just have 1 card, which is device 0
    std::cout << "Device: " << prop.name << std::endl;

    return 0;
    }

Yes, I get the right output:
Device: GeForce GTX 1050 Ti with Max-Q Design

Update: I have tried the same problematic code on a Unix machine with a Volta GPU and the problem doesn’t seem to occur.

Try reinstalling your video driver, if you haven’t already, and see what happens.

I’ve just updated the driver to the latest one released yesterday 11/8, no luck unfortunately.

Can you try the other way, install the minimum necessary version for CUDA 10 to work?

Do you know the best way to determine which older driver started supporting CUDA 10?

Check table 1 here: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html

Hi!
I had the same issue lately, and I think I know what causes your problem. Check your device, if it isn’t supports the concurrentManagedAccess (I guess no, because you wrote, you are using Windows, and it is only supported on Linux), the freeze caused by the cudaMallocManaged() function. If you doing the memory allocation with cudaMalloc() and copying it with cudaMemcpy(), the above program works fine.
I’m a beginner with CUDA and I’m not sure why it is a problem, because regarding to the docs(https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-gpu-exclusive
), you are using the Unified Memory correctly. So I would also appreciate, if someone can explain this problem.

Driver 418.81 was released today and it may help with this issue. You may wish to try it.

With the new driver its works, thank you!

I have a GeForce 1050ti like yourself and was having problems with Unified Memory like yourself. The solution is to update to the very latest driver (418.81) which was released a week ago. Check out my post at https://cudaeducation.com/cudaunifiedmemorycrash/ to learn more about the issue with links etc.

Hope this helps!

-Cuda Education
cudaeducation.com