UMA, program crashes when accessing a cudaMallocManaged array from host

I have some code with UMA and it is not really working as I expected. The code below corresponds to the .cuh and .cu:

Array_Ops.cuh

extern "C"
	{
	__declspec(dllexport) size_t array_1D_size = 0, array_1D_length = 0;
	__declspec(dllexport) float *array_1D_1, *array_1D_2;

	__declspec(dllexport) void allocate_arrays_1D(size_t MB);
	__declspec(dllexport) void deallocate_arrays_1D(void);
	__declspec(dllexport) void print_arrays_1D(void);
	}

Array_Ops.cu

#include "Array_Ops.cuh"
#include <sstream>	// For the array printing
#include <fstream>	// For the array printing
#include <string>	// For the array printing

using std::stringstream;
using std::ofstream;
using std::string;

__declspec(dllexport) void allocate_arrays_1D(size_t MB)
    {
    array_1D_length   = (MB * 1024 * 1024) / sizeof(size_t);
    array_1D_size     = array_1D_length * sizeof(size_t);

    cudaMallocManaged(&array_1D_1, array_1D_size);
    cudaMallocManaged(&array_1D_2, array_1D_size);

    cudaMemset(array_1D_1, 0, array_1D_size);
    cudaMemset(array_1D_2, 0, array_1D_size);
    }

__declspec(dllexport) void deallocate_arrays_1D(void)
    {
    cudaFree(array_1D_1);
    cudaFree(array_1D_2);
    array_1D_size = array_1D_length = 0;
    }

__declspec(dllexport) void print_arrays_1D(void)
	{
	stringstream ss_in, ss_out;
	ss_in << "Input_Array_GPU_1D_" << array_1D_length << ".txt";
	ss_out << "Output_Array_GPU_1D_" << array_1D_length << ".txt";

	string file_name = ss_in.str();
	ofstream array_1_out(file_name, ofstream::out);

	for (size_t i = 0; i < array_1D_length; i++)
		array_1_out << array_1D_1[i] << std::endl;
	array_1_out.close();

	file_name = ss_out.str();
	ofstream array_2_out(file_name, ofstream::out);

	for (size_t i = 0; i < array_1D_length; i++)
		array_2_out << array_1D_2[i] << std::endl;
	array_2_out.close();
	}

The header declares a few variables and 3 functions, to allocate, deallocate and print the arrays. I do some operations on these arrays and need to print their content to disk for quality control.
In the implementation I allocate the arrays with cudaMallocManaged, and while I operate these arrays from within the device, the application works fine, but then I need to print stuff to disk and use the print_arrays_1D() method.

Since it is running just host code, I was expecting the UMA to transparently move arrays 1 and 2 from device memory to host memory, but the program will crash at the loop starting at line 38 in the .cu file. That is, when the program tries to read the array which, until that point, is in device memory.

I thought of trying to allocate a temporary array with new, do a cudaMemcpy from the array in device to this temp array and see what happens, but then it negates the point of using UMA. As for the error handling in CUDA calls, they are omitted just for simplicity and easier reading.
Do you guys see any problem in this code?
If you need further clarification, just let me know.

after touching managed data on the device (or indeed launching any kernel at all, and possibly even certain CUDA runtime API calls which may launch a kernel, such as cudaMemcpy* with cudaDeviceToDevice; it’s possible even cudaMemset may launch a GPU kernel under the hood), on windows (its evident you are on windows) it is necessary to execute a call to cudaDeviceSynchronize(), before attempting to touch any managed data in host code.

Failure to do so will generally result in application crash - seg fault.

Good morning, txbob, and thanks for your reply.

Your advice solved the issue and it is not the first time you advise me on the cudaDeviceSynchronize. However, though I suspected CUDA calls could launch kernel functions “under the hood”, I thought the synchronization was only for user kernel calls, func <<<M, N>>> (PARAMS).

In my case, what I didn’t mention is that the managed arrays were being populated by cuRAND. Now I placed a cudaDeviceSynchronize() between curandGenerateUniform() and curandDestroyGenerator(), that is, right after cuRAND writes to the array and just before the generator is released.

Since I will need to scale the values generated by cuRAND, I will write a device method so the arrays remain in the card’s memory and now I know that a call to cudaDeviceSynchronize() will be needed. What is really strange is that in pretty much all cuRAND examples I have found, including that in Nvidia’s guide, there is no call to cudaDeviceSynchronize() even after the curandGenerateUniform() and cudaMemcpy() calls…

That is presumably because these examples do not make use of managed data (I haven’t checked). In CUDA, host and device code frequently operate asynchronously, that is, independently of each other. But at certain points, such as when accessing a shared resource like managed data, there is a need for the two sides to sync up.

Njuffa, in my particular case I was trying to print the array contents before curand finished populating it. For small arrays, maybe it was finishing fast enough so that when the print command was executed, no writing was being performed. So it was easy to miss the problem.
As soon as increased the array size, then it started to crash. Putting the barrier totally fixed it.
Thanks for your input.