How to implement a generic array in unified memory?

Here is my implmentation but it seem cannot pass the pointer of this array to the kernel function.
The mananged class of new/delete overloaded.

class Managed {
public:
	void* operator new(size_t len) {
		void* ptr;
		cudaMallocManaged(&ptr, len);
		cudaDeviceSynchronize();
		return ptr;
	}

	void* operator new[](size_t len) {
		void* ptr;
		cudaMallocManaged(&ptr, len);
		cudaDeviceSynchronize();
		return ptr;
	}

	void operator delete(void *ptr) {
		cudaDeviceSynchronize();
		cudaFree(ptr);
	}
	void operator delete[](void *ptr) {
		cudaDeviceSynchronize();
		cudaFree(ptr);
	}
};

The template array class. This array runs well on the host side but cannot be passed to kernel.

template <typename T>
class unifiedArray :public Managed
{
	const size_t length;
	T* begin;
public:
	__host__ __device__ unifiedArray(const size_t _length) :length(_length)
	{
		begin = new T[length];//Perhaps the problem is resulted by the new operator, because this new operator is not inherited from managed. But I don't have any niced idea to control the new/deleted overloaded
	}
	__host__ __device__ ~unifiedArray()
	{
		delete[] begin;
	}
	__host__ __device__ T& operator[] (unsigned int index) {
		return begin[index];//*(begin+index)
	}
	unifiedArray(const unifiedArray& uarray) :length(uarray.length) {
		begin = new T[length];//Because the new operator hasn't allocated any memory on unified memory, the copy constructor is useless.
		memccpy((void*)begin, (void*)uarray.begin, length * sizeof(T));
	}
};

It’s usually better if you provide a complete example. What do you mean by the pointer cannot be passed to the kernel? Runtime error? Compile error? incorrect results?

Note that the C++ overload approach is discussed in this blog:

https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/

and there are fully worked examples here:

https://github.com/parallel-forall/code-samples/tree/master/posts/unified-memory

That means when I was debugging, I found that I cannot see any data except ‘???’ in the cuda debugger.

Here is the rest part of my codes.

template <typename T>
__global__ void plusToUnifiedArray(unifiedArray<T>* arr) {
	int tid = threadIdx.x;
	(*arr)[tid] = (*arr)[tid] + 1;//Add 1 to each element. I checked the arr pointer in cuda debugger, but it points to something unkown
}

int main()
{
	unifiedArray<int> arr(10);
	for (size_t i = 0; i < 10; i++)
	{
		arr[i] = i;//set the elements to 0,1,2...9
	}

	plusToUnifiedArray << <1, 10 >>> (&arr);//add 1 to each element
	cudaThreadSynchronize();

	for (size_t i = 0; i < 10; i++)
        {
		std::cout << (arr[i]) << std::endl;//the idea outputs should be 1,2,3...10.But the outputs is 0 to 9
	}

    return 0;
}

How about if you printf from a kernel? Do you see the data in the printf printout?

For the debugger, it is necessary to be stopped at a breakpoint in device code before you can see device resident data.

I tried to add one to all elements of the array in the kernel but nothing happened. The printf is similar to the debug result, nothing is printed to the command line through printf.

If you do:

printf(“hello”);

from kernel code, and you don’t get hello printed out, then your kernel is not executing or launching correctly. If your kernel is not launching correctly or hitting a critical fault, you will not be able to set a breakpoint in device code and hit it reliably, which is the first necessary step before you can inspect device data with the debugger

Any time you are having trouble with a CUDA code, you should use proper CUDA error checking and run your code with cuda-memcheck

I personally would not pull out the debugger until both of these methods give clean (no error) reports.

“but I need the debugger to figure out what to fix”

I would say:

you probably don’t.

Compiling your code with -lineinfo means that cuda-memcheck will identify the exact line of kernel code that is triggering the error reported by cuda-memcheck. That plus the type of error is usually enough to figure it out and fix it.

Once the code is running error free, and you like to use debuggers, then go ahead.

I’m sure that the kernel launched successfully.
The cuda-memcheck reviewed that the parameter passed to kernel is rest in the host side.

=========     at 0x00000158 in D:/ProgrammingAndStudy/STM/GPUunifiedclass/TestArray/kernel.cu:29:unifiedArray<int>::__operator_[]__(unsigned int)
=========     by thread (9,0,0) in block (0,0,0)
=========     Address 0x8c6d6ff620 is out of bounds
=========     Device Frame:D:/ProgrammingAndStudy/STM/GPUunifiedclass/TestArray/kernel.cu:42:void plusToUnifiedArray<int>(unifiedArray<int>*) (void plusToUnifiedArray<int>(unifiedArray<int>*) : 0x1b0)
=========     Saved host backtrace up to driver entry point at kernel launch time

As I mentioned in the top level, this error is caused by the new operator of the generic type T is not overloaded by the cudaMallocManaged(). However, I don’t know how to overload the new/delete operator of a generic type because the compiler will complain that T is an incomplete type.

You are correct. The new operator for a basic type cannot be overloaded. You can overload the new operator for an object of a class, but that is not what you are creating with

T* begin;

So you don’t get the overloaded new with that pointer.

This will also be problematic:

plusToUnifiedArray << <1, 10 >>> (&arr);

unless you overload the unary & operator for your class, this is simply going to give you the address of the host class itself, which is somewhere in host memory that is not managed space. Attempting to use that pointer in device code in any way will not work (illegal access). That is the “proximal” reason for the problems in device code.

There are a variety of ways and variations of ways to make something like what you have work. Here is a simple example:

$ cat t350.cu
#include <iostream>

template <typename T>
class unifiedArray
{
        const size_t length;
        T* begin;
public:
        __host__ unifiedArray(const size_t _length) :length(_length)
        {
                cudaMallocManaged(&begin, length*sizeof(T));
                cudaDeviceSynchronize();
        }
        __host__ __device__ ~unifiedArray()
        {
                cudaDeviceSynchronize();
                cudaFree(begin);
        }
        __host__ __device__ T& operator[] (unsigned int index) {
                return begin[index];//*(begin+index)
        }
        unifiedArray(const unifiedArray& uarray) :length(uarray.length) {
                cudaMallocManaged(&begin, length*sizeof(T));
                memcpy((void*)begin, (void*)uarray.begin, length * sizeof(T));
        }
};

template <typename T>
__global__ void plusOne(T * arr) {
        int tid = threadIdx.x;
        arr[tid] = arr[tid] + 1;
}

int main()
{
        unifiedArray<int> arr(10);
        for (size_t i = 0; i < 10; i++)
        {
                arr[i] = i;//set the elements to 0,1,2...9
        }

        plusOne<< <1, 10 >>> (&(arr[0]));//add 1 to each element
        cudaDeviceSynchronize();

        for (size_t i = 0; i < 10; i++)
        {
                std::cout << (arr[i]) << std::endl;//the idea outputs should be 1,2,3...10.But the outputs is 0 to 9
        }

    return 0;
}
$ nvcc -arch=sm_61 -o t350 t350.cu
$ cuda-memcheck ./t350
========= CUDA-MEMCHECK
1
2
3
4
5
6
7
8
9
10
========= ERROR SUMMARY: 0 errors
$