Hi danxpy,
I have tried using -ta=tesla:managed
and -ta=tesla:deepcopy
but both failed.
Here’s the error:
% nvc++ -acc test.org.cpp -Minfo=accel -gpu=managed ; a.out
main:
53, Generating NVIDIA GPU code
56, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
53, Generating implicit copy(T,m) [if not already present]
ScalarField::ScalarField():
13, Generating enter data copyin(this[:1])
Generating enter data create(_data[:_size])
ScalarField::~ScalarField():
17, Generating exit data delete(this[:1],_data[:1])
ScalarField::size() const:
21, Generating acc routine seq
Generating NVIDIA GPU code
ScalarField::operator[](int) const:
25, Generating acc routine seq
Generating NVIDIA GPU code
ScalarField::update_self():
30, Generating update self(_data[:_size])
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 8.0, threadid=1
Hint: specify 0x800 bit in NV_ACC_DEBUG for verbose info.
host:0x7ffdca894980 device:0x1497212fa400 size:16 presentcount:1+1 line:13 name:this[:1]
host:0x7ffdca8949e8 device:0x1497212fa000 size:16 presentcount:0+1 line:13 name:this[:1]
host:0x7ffdca8949f8 device:0x1497212fa200 size:16 presentcount:0+1 line:13 name:this[:1]
allocated block device:0x1497212fa000 size:512 thread:1
allocated block device:0x1497212fa200 size:512 thread:1
allocated block device:0x1497212fa400 size:512 thread:1
Present table errors:
m lives at 0x7ffdca8949e8 size 32 partially present in
host:0x7ffdca8949e8 device:0x1497212fa000 size:16 presentcount:0+1 line:13 name:this[:1] file:test.org.cpp
host:0x7ffdca8949f8 device:0x1497212fa200 size:16 presentcount:0+1 line:13 name:this[:1] file:test.org.cpp
FATAL ERROR: variable in data clause is partially present on the device: name=m
file: test.org.cpp main line:53
The Matrix “m” is not in a data region, so the compiler needs to implicitly copy it when entering the compute region (as indicated in the compiler feedback messages). However, it’s two static members are already present on the device since the ScalarField’s constructor is creating the “this” pointer. Part, but not all, of “m” is already present and hence the error.
CUDA Unified Memory (aka “managed”), is only available for allocated memory with only “_data” is being managed. Since the error occurs due to the static members in Matrix, the error still occurs when using managed.
When you remove the data regions, the runtime can now do a shallow copy of “m” and it’s members with “_data” being a unified memory pointer that’s accessible on both the host and device.
If you changed the code so “m”, “T”, and m’s two data members are allocated, then they all would be put in unified memory.
If you wish to manually mange the data, you need to make a few changes. The way you have “ScalarField” would work if it were not a member of Matrix. The problem being that Matrix also needs to be on the device and the copy needs to come from the top down.
To fix, instead of putting the data region in the constructor, move this to a routine. Next add a constructor to Matrix which copies itself to the device and then calls this initialization routine to get ScalarField’s _data on to the device.
Something like the following:
% cat test.cpp
#include <cstdlib>
#include <iostream>
#define Scalar double
class ScalarField {
private:
const int _size = 256;
Scalar* _data;
public:
ScalarField() {
_data = new Scalar[_size];
}
~ScalarField() {
delete[] _data;
#pragma acc exit data delete(_data,this)
_data = 0;
}
void init() {
#pragma acc enter data copyin(this) create(_data[0:_size])
}
#pragma acc routine seq
int size() const {
return _size;
}
#pragma acc routine seq
Scalar& operator [] (int i) const {
return _data[i];
}
void update_self() {
#pragma acc update self(_data[0:_size])
}
void update_device() {
#pragma acc update device(_data[0:_size])
}
};
class Matrix {
public:
ScalarField Su;
ScalarField cF;
Matrix () {
#pragma acc enter data copyin(this)
Su.init();
cF.init();
}
void update_self() {
Su.update_self();
cF.update_self();
}
void update_device() {
Su.update_device();
cF.update_device();
}
};
int main() {
Matrix m;
ScalarField T;
T.init();
#pragma acc parallel loop present(m,T)
for(int i = 0; i < T.size(); i++) {
T[i] = Scalar(i);
m.Su[i] = Scalar(1.0);
m.cF[i] = T[i] * 3 + m.Su[i];
}
T.update_self();
m.update_self();
for(int i = 0; i < T.size(); i++) {
std::cout << i << " " << T[i] << " " << m.Su[i] << " " << m.cF[i] << std::endl;
}
return 0;
}
% nvc++ -acc test.cpp -Minfo=accel ; a.out
main:
66, Generating present(T,m)
Generating NVIDIA GPU code
68, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
ScalarField::~ScalarField():
18, Generating exit data delete(this[:1],_data[:1])
ScalarField::init():
23, Generating enter data copyin(this[:1])
Generating enter data create(_data[:_size])
ScalarField::size() const:
26, Generating acc routine seq
Generating NVIDIA GPU code
ScalarField::operator[](int) const:
30, Generating acc routine seq
Generating NVIDIA GPU code
ScalarField::update_self():
35, Generating update self(_data[:_size])
Matrix::Matrix():
48, Generating enter data copyin(this[:1])
0 0 1 1
1 1 1 4
2 2 1 7
3 3 1 10
... cut due to length ...
253 253 1 760
254 254 1 763
255 255 1 766
Hope this helps,
Mat