Multidimensional array allocation with Cuda Unified Memory

Hi everyone,

I’m trying to allocate multi dimesnional arrays by using CUDA UMA. However, I’m having issue while size is getting bigger. The code I’m using is below. When size is 24 x 24 x 24 x 5 works fine. When I increase it to 64 x 64 x 64 x 8 I am having segmentation fault. Afaik, I suppose to be able to allocate memory via UMA as much as GPU device physical memory. So I would not expect any error. If you correct and tell me what I am missing I’d appreciate.

Thanks in advance

template<typename T>
void* operator new[] (size_t len) throw(std::bad_alloc) {
    void *ptr;
    cudaMallocManaged(&ptr, len);
    return ptr;
}
template<typename T>
T**** create_4d(int a, int b, int c, int d){
	T**** ary = new T***[a];
	for(int i = 0; i < a; ++i){
		ary[i] = new T**[b];
		for(int j = 0; j < b; ++j){
			ary[i][j] = new T*[c];
			for(int k = 0; k < c; ++k){
				ary[i][j][k] = new T[d];
			}
		}
	}
	return ary;
}

which CUDA version? If you’re not using CUDA 8, try CUDA 8:

http://stackoverflow.com/questions/38078737/cudamallocmanaged-returns-out-of-memory-despite-enough-free-space

And by the way, you could avoid the seg fault, even in the failing case, if you did proper CUDA error checking.

My suggestion, any time you are having trouble with a CUDA code, is to use proper CUDA error checking, before asking for help on a public forum. If you don’t know what proper CUDA error checking is, google “proper CUDA error checking” and take the first hit.

I am using cua 8 on k40

If you want to provide a complete example that demonstrates the error (that I can copy, paste, compile and run,without having to add anything or change anything) I will take a look.

I’m not going to play 20 questions like “what size type are you allocating” etc.

And as an aside, this approach is not likely to be very good for performance.

Might it be simpler to allocate a 1d array but index it like it’s 4d?

It will not only be simpler but higher performance.

  1. Pointer chasing isn’t terribly fast on GPUs (at least not compared to arithmetic index calculation add/mpy)
  2. A bunch of fractured allocations will create a wash of traffic at kernel launch time, large numbers of relatively small transfers, as opposed to one large transfer.

Hi all, thank you for your answer and help. Please look at the following code block. To reproduce segmentation fault, you can simply compile with g++ or clang++. First allocation is fine, second one fails.

BTW, if you tell me another way to do that apart from transforming all my code from 4d array to 1d array, I’ll so appreciate.

Thanks in advance

#include <iostream>
#include <cuda_runtime.h>

void* operator new[] (size_t len) throw(std::bad_alloc) {
    void *ptr;
    cudaMallocManaged(&ptr, len);
    return ptr;
}

template<typename T>
T**** create_4d(int a, int b, int c, int d){
	T**** ary = new T***[a];
	for(int i = 0; i < a; ++i)
	{
		ary[i] = new T**[b];
		for(int j = 0; j < b; ++j){
			ary[i][j] = new T*[c];
			for(int k = 0; k < c; ++k){
				ary[i][j][k] = new T[d];
			}
		}
	}
	return ary;
}

int main() {
   double ****data;
   
   std::cout << "allocating..." << std::endl;
   data = create_4d<double>(32,65,65,5);
   std::cout << "Hooreey !!!" << std::endl;

   //segfault here
   std::cout << "allocating..." << std::endl;
   data = create_4d<double>(64,65,65,5);
   std::cout << "Hooreey !!!" << std::endl;   
  
   return 0;
}

Hi all,

Currently my main configuration is Power 8 and Tesla k40. I am having issue there. However, I tried the code piece I provided above on intel + k40 machine. It surprisingly worked.

So, is that a nvidia bug for power systems ? It is kind of obstacle for my work at the moment :/

Best Regards

In re: flattening the arrays to 1D, something like this: (in that case it is doing a 2D and 3D example, but easily extendable to 4D, and this is more efficient)

This 2D example explains how you can use a function to simplify the code/indexing:

hi @vacaloca thank you. But I prefer to not change array indexing of my entire code for that.

As I already pointed out in a comment to the slightly expanded version of this question that was cross-posted to Stackoverflow, your data structure is not a multi-dimensional array, but rather a tree with four levels. It is just a quirk of C++ syntax that the access looks like access to a multi-dimensional array.

The costs of this approach are quite high. Every time you create such an object, it will require hundreds of thousands of small allocations, likely to fragment available memory. Access to arbitrary submatrices will be complicated and slow, and the data structure will not allow the application of matrix arithmetic libraries such as BLAS.

Most importantly, access to an array element will require pointer chasing through multiple levels of indirection, turning each element access into four dependent memory accesses for the general case. This is very likely to have a significant negative impact on code performance.

Aside from the issue that Norbert mentioned (this is going to be very inefficient on the GPU),
your problem is that you are running out of memory.

If you add this function and check the GPU memory usage before/after every create_4d:

void report_gpu_mem()
{
    size_t free, total;
    cudaMemGetInfo(&free, &total);
    std::cout << "Free = " << free << " Total = " << total <<std::endl;
}

you will see that the memory consumption is huge ( this is due to how UVM works pre-Pascal)

Power8 with K40
# ./a.out 
allocating...
Free = 12700286976 Total = 12799574016
Hooreey !!!
Free =  3697213440 Total = 12799574016
allocating...
Segmentation fault

The first allocation that should use (326565*5)*8bytes~ 5MB, is taking away 9GB from the device memory.

Note that on Pascal ( that has better support for UVM), your code works as expected.

Power8 with Pascal P100
./a.out 
allocating...
Free = 16770727936 Total = 17071669248
Hooreey !!!
Free = 16770727936 Total = 17071669248
allocating...
Hooreey !!!

The fairly coarse granularity of UVM allocation (minimum of one 4KB page per allocation?) is a very good point, although the OP mentioned in their cross-posted question on Stackoverflow that the above program succeeds on their x64 platform but fails on their Power 8 platform, which might be a bug (http://stackoverflow.com/questions/40388242/multidimensional-array-allocation-with-cuda-unified-memory-on-power-8).

It may be related to different page sizes on the two systems.

On x86, the problem is still there but less extreme:

x86 with K80
./a.out
allocating...
Free = 11929124864 Total = 11995578368
Free = 11362893824 Total = 11995578368
Hooreey !!!
allocating...
Free = 11362893824 Total = 11995578368
Free = 10236723200 Total = 11995578368
Hooreey !!!

The memory reserved is still much larger than the storage required for the array.

If you use a standard cudaMalloc or cudaMallocManaged on flat arrays this is the output

Standard cudaMalloc...
Free = 11929124864 Total = 11995578368
Free = 11912740864 Total = 11995578368
Flat cudaMallocManaged...
Free = 11929124864 Total = 11995578368
Free = 11912347648 Total = 11995578368
allocating...
Free = 11929124864 Total = 11995578368
Free = 11362893824 Total = 11995578368
Hooreey !!!
allocating...
Free = 11362893824 Total = 11995578368
Free = 10236723200 Total = 11995578368
Hooreey !!!

cudaMalloc memory usage = 82837504
cudaMallocManaged memory usage on flat array = 83230720
your memory usage = 1758855168 (21X more)

Hi all,

Thank you for yours important answers they were really really so helpful for me. I was thinking it’s bug, you enlighten this issue.

I made a another small test that allocates 1d, 2d and 3d array with technique that I provided (like tree).

X64 with k40
 Allocating -> 80MB Allocated -> 84MB !!!
 Allocating [2D]-> 80MB Allocated -> 130MB !!!
 Allocating [3D]-> 80MB Allocated -> 134MB !!!
power8 with k40
 Allocating -> 80MB Allocated -> 84MB !!!
 Allocating [2D]-> 80MB Allocated -> 2080MB !!!
 Allocating [3D]-> 80MB Allocated -> 2114MB !!!

As I understand, if you are working with flat arrays both system are working in the same way. But if you are doing pointer chasing, connecting pointer with each other like i did, runtime allocates excessively huge amount of memory.

pagesizes of systems 4kb ,64kb respectively. Can we interpret this difference is because of pagesize ? 8x times big page size, it allocates more or less 8 times bigger portion ?

and another quick question, in case of pascal gpu everything works as expected. But is that recommended way for pascal ? I mean, Do I loose performance in pascal since i used multi dimensional arrays ?

Thinking more about the case with Pascal, your code was just allocating.
It is possible that once you touch the pages, the same overhead may show up even on Pascal.

Right, because I connect pointers to make multi dimensional array in host side basically. So first touch will be by host. I would like to try when I have chance with pascal. Also is this case same with OS based multi dimensional array allocating in pascal without using cudaMallocManaged ?

But another question, does page size make difference of performance ? Because there is sort of huge difference between x86 and power pages.

By the way, thank you for quick reply and interest. It was really helpful conversation for me.