I’m trying to allocate multi dimesnional arrays by using CUDA UMA. However, I’m having issue while size is getting bigger. The code I’m using is below. When size is 24 x 24 x 24 x 5 works fine. When I increase it to 64 x 64 x 64 x 8 I am having segmentation fault. Afaik, I suppose to be able to allocate memory via UMA as much as GPU device physical memory. So I would not expect any error. If you correct and tell me what I am missing I’d appreciate.
Thanks in advance
template<typename T>
void* operator new[] (size_t len) throw(std::bad_alloc) {
void *ptr;
cudaMallocManaged(&ptr, len);
return ptr;
}
template<typename T>
T**** create_4d(int a, int b, int c, int d){
T**** ary = new T***[a];
for(int i = 0; i < a; ++i){
ary[i] = new T**[b];
for(int j = 0; j < b; ++j){
ary[i][j] = new T*[c];
for(int k = 0; k < c; ++k){
ary[i][j][k] = new T[d];
}
}
}
return ary;
}
And by the way, you could avoid the seg fault, even in the failing case, if you did proper CUDA error checking.
My suggestion, any time you are having trouble with a CUDA code, is to use proper CUDA error checking, before asking for help on a public forum. If you don’t know what proper CUDA error checking is, google “proper CUDA error checking” and take the first hit.
If you want to provide a complete example that demonstrates the error (that I can copy, paste, compile and run,without having to add anything or change anything) I will take a look.
I’m not going to play 20 questions like “what size type are you allocating” etc.
And as an aside, this approach is not likely to be very good for performance.
It will not only be simpler but higher performance.
Pointer chasing isn’t terribly fast on GPUs (at least not compared to arithmetic index calculation add/mpy)
A bunch of fractured allocations will create a wash of traffic at kernel launch time, large numbers of relatively small transfers, as opposed to one large transfer.
Hi all, thank you for your answer and help. Please look at the following code block. To reproduce segmentation fault, you can simply compile with g++ or clang++. First allocation is fine, second one fails.
BTW, if you tell me another way to do that apart from transforming all my code from 4d array to 1d array, I’ll so appreciate.
Thanks in advance
#include <iostream>
#include <cuda_runtime.h>
void* operator new[] (size_t len) throw(std::bad_alloc) {
void *ptr;
cudaMallocManaged(&ptr, len);
return ptr;
}
template<typename T>
T**** create_4d(int a, int b, int c, int d){
T**** ary = new T***[a];
for(int i = 0; i < a; ++i)
{
ary[i] = new T**[b];
for(int j = 0; j < b; ++j){
ary[i][j] = new T*[c];
for(int k = 0; k < c; ++k){
ary[i][j][k] = new T[d];
}
}
}
return ary;
}
int main() {
double ****data;
std::cout << "allocating..." << std::endl;
data = create_4d<double>(32,65,65,5);
std::cout << "Hooreey !!!" << std::endl;
//segfault here
std::cout << "allocating..." << std::endl;
data = create_4d<double>(64,65,65,5);
std::cout << "Hooreey !!!" << std::endl;
return 0;
}
Currently my main configuration is Power 8 and Tesla k40. I am having issue there. However, I tried the code piece I provided above on intel + k40 machine. It surprisingly worked.
So, is that a nvidia bug for power systems ? It is kind of obstacle for my work at the moment :/
In re: flattening the arrays to 1D, something like this: (in that case it is doing a 2D and 3D example, but easily extendable to 4D, and this is more efficient)
As I already pointed out in a comment to the slightly expanded version of this question that was cross-posted to Stackoverflow, your data structure is not a multi-dimensional array, but rather a tree with four levels. It is just a quirk of C++ syntax that the access looks like access to a multi-dimensional array.
The costs of this approach are quite high. Every time you create such an object, it will require hundreds of thousands of small allocations, likely to fragment available memory. Access to arbitrary submatrices will be complicated and slow, and the data structure will not allow the application of matrix arithmetic libraries such as BLAS.
Most importantly, access to an array element will require pointer chasing through multiple levels of indirection, turning each element access into four dependent memory accesses for the general case. This is very likely to have a significant negative impact on code performance.
The fairly coarse granularity of UVM allocation (minimum of one 4KB page per allocation?) is a very good point, although the OP mentioned in their cross-posted question on Stackoverflow that the above program succeeds on their x64 platform but fails on their Power 8 platform, which might be a bug ([url]Multidimensional array allocation with Cuda Unified Memory on Power 8 - Stack Overflow).
As I understand, if you are working with flat arrays both system are working in the same way. But if you are doing pointer chasing, connecting pointer with each other like i did, runtime allocates excessively huge amount of memory.
pagesizes of systems 4kb ,64kb respectively. Can we interpret this difference is because of pagesize ? 8x times big page size, it allocates more or less 8 times bigger portion ?
and another quick question, in case of pascal gpu everything works as expected. But is that recommended way for pascal ? I mean, Do I loose performance in pascal since i used multi dimensional arrays ?
Thinking more about the case with Pascal, your code was just allocating.
It is possible that once you touch the pages, the same overhead may show up even on Pascal.
Right, because I connect pointers to make multi dimensional array in host side basically. So first touch will be by host. I would like to try when I have chance with pascal. Also is this case same with OS based multi dimensional array allocating in pascal without using cudaMallocManaged ?
But another question, does page size make difference of performance ? Because there is sort of huge difference between x86 and power pages.
By the way, thank you for quick reply and interest. It was really helpful conversation for me.