How to implement a generic array in unified memory?

EternalSaga · July 1, 2017, 10:29pm

Here is my implmentation but it seem cannot pass the pointer of this array to the kernel function.
The mananged class of new/delete overloaded.

class Managed {
public:
	void* operator new(size_t len) {
		void* ptr;
		cudaMallocManaged(&ptr, len);
		cudaDeviceSynchronize();
		return ptr;
	}

	void* operator new[](size_t len) {
		void* ptr;
		cudaMallocManaged(&ptr, len);
		cudaDeviceSynchronize();
		return ptr;
	}

	void operator delete(void *ptr) {
		cudaDeviceSynchronize();
		cudaFree(ptr);
	}
	void operator delete[](void *ptr) {
		cudaDeviceSynchronize();
		cudaFree(ptr);
	}
};

The template array class. This array runs well on the host side but cannot be passed to kernel.

template <typename T>
class unifiedArray :public Managed
{
	const size_t length;
	T* begin;
public:
	__host__ __device__ unifiedArray(const size_t _length) :length(_length)
	{
		begin = new T[length];//Perhaps the problem is resulted by the new operator, because this new operator is not inherited from managed. But I don't have any niced idea to control the new/deleted overloaded
	}
	__host__ __device__ ~unifiedArray()
	{
		delete[] begin;
	}
	__host__ __device__ T& operator[] (unsigned int index) {
		return begin[index];//*(begin+index)
	}
	unifiedArray(const unifiedArray& uarray) :length(uarray.length) {
		begin = new T[length];//Because the new operator hasn't allocated any memory on unified memory, the copy constructor is useless.
		memccpy((void*)begin, (void*)uarray.begin, length * sizeof(T));
	}
};

Robert_Crovella · July 1, 2017, 10:44pm

It’s usually better if you provide a complete example. What do you mean by the pointer cannot be passed to the kernel? Runtime error? Compile error? incorrect results?

Note that the C++ overload approach is discussed in this blog:

[url]https://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/[/url]

and there are fully worked examples here:

[url]https://github.com/parallel-forall/code-samples/tree/master/posts/unified-memory[/url]

EternalSaga · July 1, 2017, 10:54pm

That means when I was debugging, I found that I cannot see any data except ‘???’ in the cuda debugger.

Here is the rest part of my codes.

template <typename T>
__global__ void plusToUnifiedArray(unifiedArray<T>* arr) {
	int tid = threadIdx.x;
	(*arr)[tid] = (*arr)[tid] + 1;//Add 1 to each element. I checked the arr pointer in cuda debugger, but it points to something unkown
}

int main()
{
	unifiedArray<int> arr(10);
	for (size_t i = 0; i < 10; i++)
	{
		arr[i] = i;//set the elements to 0,1,2...9
	}

	plusToUnifiedArray << <1, 10 >>> (&arr);//add 1 to each element
	cudaThreadSynchronize();

	for (size_t i = 0; i < 10; i++)
        {
		std::cout << (arr[i]) << std::endl;//the idea outputs should be 1,2,3...10.But the outputs is 0 to 9
	}

    return 0;
}

Robert_Crovella · July 2, 2017, 12:50am

How about if you printf from a kernel? Do you see the data in the printf printout?

For the debugger, it is necessary to be stopped at a breakpoint in device code before you can see device resident data.

EternalSaga · July 2, 2017, 1:15am

I tried to add one to all elements of the array in the kernel but nothing happened. The printf is similar to the debug result, nothing is printed to the command line through printf.

Robert_Crovella · July 2, 2017, 1:39am

If you do:

printf(“hello”);

from kernel code, and you don’t get hello printed out, then your kernel is not executing or launching correctly. If your kernel is not launching correctly or hitting a critical fault, you will not be able to set a breakpoint in device code and hit it reliably, which is the first necessary step before you can inspect device data with the debugger

Any time you are having trouble with a CUDA code, you should use proper CUDA error checking and run your code with cuda-memcheck

I personally would not pull out the debugger until both of these methods give clean (no error) reports.

“but I need the debugger to figure out what to fix”

I would say:

you probably don’t.

Compiling your code with -lineinfo means that cuda-memcheck will identify the exact line of kernel code that is triggering the error reported by cuda-memcheck. That plus the type of error is usually enough to figure it out and fix it.

Once the code is running error free, and you like to use debuggers, then go ahead.

EternalSaga · July 2, 2017, 10:33am

If you do:

printf(“hello”);

from kernel code, and you don’t get hello printed out, then your kernel is not executing or launching correctly. If your kernel is not launching correctly or hitting a critical fault, you will not be able to set a breakpoint in device code and hit it reliably, which is the first necessary step before you can inspect device data with the debugger

Any time you are having trouble with a CUDA code, you should use proper CUDA error checking and run your code with cuda-memcheck

I personally would not pull out the debugger until both of these methods give clean (no error) reports.

“but I need the debugger to figure out what to fix”

I would say:

you probably don’t.

Compiling your code with -lineinfo means that cuda-memcheck will identify the exact line of kernel code that is triggering the error reported by cuda-memcheck. That plus the type of error is usually enough to figure it out and fix it.

Once the code is running error free, and you like to use debuggers, then go ahead.

I’m sure that the kernel launched successfully.
The cuda-memcheck reviewed that the parameter passed to kernel is rest in the host side.

=========     at 0x00000158 in D:/ProgrammingAndStudy/STM/GPUunifiedclass/TestArray/kernel.cu:29:unifiedArray<int>::__operator_[]__(unsigned int)
=========     by thread (9,0,0) in block (0,0,0)
=========     Address 0x8c6d6ff620 is out of bounds
=========     Device Frame:D:/ProgrammingAndStudy/STM/GPUunifiedclass/TestArray/kernel.cu:42:void plusToUnifiedArray<int>(unifiedArray<int>*) (void plusToUnifiedArray<int>(unifiedArray<int>*) : 0x1b0)
=========     Saved host backtrace up to driver entry point at kernel launch time

As I mentioned in the top level, this error is caused by the new operator of the generic type T is not overloaded by the cudaMallocManaged(). However, I don’t know how to overload the new/delete operator of a generic type because the compiler will complain that T is an incomplete type.

Robert_Crovella · July 2, 2017, 1:58pm

You are correct. The new operator for a basic type cannot be overloaded. You can overload the new operator for an object of a class, but that is not what you are creating with

T* begin;

So you don’t get the overloaded new with that pointer.

This will also be problematic:

plusToUnifiedArray << <1, 10 >>> (&arr);

unless you overload the unary & operator for your class, this is simply going to give you the address of the host class itself, which is somewhere in host memory that is not managed space. Attempting to use that pointer in device code in any way will not work (illegal access). That is the “proximal” reason for the problems in device code.

There are a variety of ways and variations of ways to make something like what you have work. Here is a simple example:

$ cat t350.cu
#include <iostream>

template <typename T>
class unifiedArray
{
        const size_t length;
        T* begin;
public:
        __host__ unifiedArray(const size_t _length) :length(_length)
        {
                cudaMallocManaged(&begin, length*sizeof(T));
                cudaDeviceSynchronize();
        }
        __host__ __device__ ~unifiedArray()
        {
                cudaDeviceSynchronize();
                cudaFree(begin);
        }
        __host__ __device__ T& operator[] (unsigned int index) {
                return begin[index];//*(begin+index)
        }
        unifiedArray(const unifiedArray& uarray) :length(uarray.length) {
                cudaMallocManaged(&begin, length*sizeof(T));
                memcpy((void*)begin, (void*)uarray.begin, length * sizeof(T));
        }
};

template <typename T>
__global__ void plusOne(T * arr) {
        int tid = threadIdx.x;
        arr[tid] = arr[tid] + 1;
}

int main()
{
        unifiedArray<int> arr(10);
        for (size_t i = 0; i < 10; i++)
        {
                arr[i] = i;//set the elements to 0,1,2...9
        }

        plusOne<< <1, 10 >>> (&(arr[0]));//add 1 to each element
        cudaDeviceSynchronize();

        for (size_t i = 0; i < 10; i++)
        {
                std::cout << (arr[i]) << std::endl;//the idea outputs should be 1,2,3...10.But the outputs is 0 to 9
        }

    return 0;
}
$ nvcc -arch=sm_61 -o t350 t350.cu
$ cuda-memcheck ./t350
========= CUDA-MEMCHECK
1
2
3
4
5
6
7
8
9
10
========= ERROR SUMMARY: 0 errors
$

Topic		Replies	Views
passing an array to a kenel ? CUDA Programming and Performance	9	13305	June 10, 2009
Segmentation fault when calling virtual function on host CUDA Programming and Performance	9	2476	September 10, 2019
Help with cuda 2d array CUDA Programming and Performance	6	7452	September 29, 2014
Passing a multidimensional array to kernel how to allocate space in host and pass to device? CUDA Programming and Performance	12	16209	November 22, 2014
Is CUDA C or C++ ? CUDA Programming and Performance	12	33647	January 30, 2009
Pass pointer to class as a kernel argument and access class methods CUDA Programming and Performance	1	3454	July 5, 2018
New dynamic shared memory allocation in CUDA 5? CUDA Programming and Performance	5	4281	November 8, 2012
Using unified memory for 2Dim and 3 Dim array CUDA Programming and Performance	2	529	November 18, 2018
Constant memory when having more than one file external does not work CUDA Programming and Performance	24	3212	August 27, 2010
[Beginner] Memory is reseted in the kernel CUDA Programming and Performance	5	1318	October 29, 2010

How to implement a generic array in unified memory?

Related topics