Device data not transfering back to host correctly

Hello,

I was hoping someone could help me find my error in pulling data back from the device and updating the host. I read the earlier post (https://forums.developer.nvidia.com/t/openacc-declare-and-update-directives/133314/1) about how you need to initialize data on the GPU after a create data pragma, or else you will only get uninitialized values back. My problem is very similar, as I am only getting “zeroed” out values when I try to update host after doing calculations on the device, but I am pretty sure I have the initializing correct, I think my problem may be in the pragma pulling the data to host.

The 2 pieces of data are (C++std ) vector< vector > p_density_vector (aka host data) and double ** gpu_p_dentsity_vector (aka device data) and the idea is that gpu_p_density_vector is initialized through pointers to the values of p_density_vectors, then gpu_p_density_vector is altered on the gpu in the work loop, then gpu_p_density_vectors is updated on the host, and finally p_density_vectors is updated by the new values of gpu_p_density_vectors.

This is the code to transfer and initializing

void Microenvironment::transfer_3D()
{
        // start gpu_p_density_vectors
        const int bin_p_density_vectors = (*p_density_vectors).size();

        gpu_p_density_vectors = new double*[bin_p_density_vectors];

        sizes_p_density_vectors = new int[bin_p_density_vectors];
        #pragma acc enter data copyin(this[0:1])
        #pragma acc enter data create(this->gpu_p_density_vectors[0:bin_p_density_vectors][0:0])

        for (int i = 0; i < bin_p_density_vectors; i++){
                int sze = (*p_density_vectors)[i].size();
                sizes_p_density_vectors[i] = sze;
                gpu_p_density_vectors[i] = (*p_density_vectors)[i].data();
                #pragma acc enter data copyin(this->gpu_p_density_vectors[i:1][:sze])
        }
        #pragma acc enter data copyin(this->sizes_p_density_vectors[:bin_p_density_vectors])
        // end gpu_p_density_vectors
}

Then there is a work loop, altering gpu_p_density_vectors on GPU

This is the code to update the host, and update p_density_vectors with the new values in gpu_p_density_vectors

void Microenvironment::translate_array_to_vector()
        /* translate_array_to_vector is actually an update between the two versions
         * of p_density_vector (DEVICE = gpu_p_density_vector, HOST = p_density_vector),
         * updating the host with the current values of the 
         * device gpu_p_density_vector
         */
{
        const int bin_p_density_vectors = (*p_density_vectors).size();

        for (int i = 0; i < bin_p_density_vectors; i++){
                int sze = (*p_density_vectors)[i].size();
                #pragma acc update host(this->gpu_p_density_vectors[i:1][:sze])
        }
        #pragma acc update host(this->sizes_p_density_vectors[:bin_p_density_vectors])

        for (int i = 0; i < bin_p_density_vectors; i++) {
                for (int j = 0; j < this->sizes_p_density_vectors[i]; j++) {
                        (*p_density_vectors)[i][j] = this->gpu_p_density_vectors[i][j];
                }
        }
}

But the problem is in examples, the values are “nothing” as in they are still the initial values as if no work had been done, but with nvidia-smi I can see the data on the gpu. Would anyone be able to help me by pointing out my error?

Thanks a bunch!

Hi mattstack,

I’m not seeing anything wrong with what you have (at least this is how I’d write this code) so I suspect that something else is going on, such as your compute region. Can you provide a full reproducing example?

Given the complexity of your data structure, you may consider using CUDA Unified Memory (i.e. -ta=tesla:managed) instead of manually managing the data. You’d also be able to use the vector directly on the device rather than using the extra “gpu” array. Just don’t push/pop or allocate the vector on the device and stick to using the data access operator (i.e. ) and it should work fine.

-Mat

Hi Mat,

Thanks for your response! I tried to isolate what I think may be my mistake, could you let me know what you think of this example?

This is a very bare bones reconstruction of the project, and the interesting part is down in main() there is two comments TRY 1 and TRY 2. If you comment out them individually there are two different results. TRY 1 does not work and results in a segfault but other times Ive tried to isolate this it didnt segfault but gave strange numbers. TRY 2 does work with no segfault and correct values, but it is the same code, just not in a class function. Very strange, do you think I might be messing up my pointers somewhere in TRY 1?

Thanks for your help!!

Let me know if it gives errors, I can try to fix it

Compile Ive been using:
pgc++ -acc -ta=tesla,cc70 -o test.o test.cpp

#include <iostream>
#include <vector>
#include "openacc.h"
#include <cmath>

using namespace std;


class Micro {
        public:
                std::vector< std::vector<double> >* p_density_vectors;

                double ** gpu_p_density_vectors;

                int * sizes_p_density_vectors;

                bool data_bool = false;

                void build();

                void redirect();

                void transfer_3D();

                void translate_vector_to_array();
                void translate_array_to_vector();

                void diffusion_gpu();

                void axpy_gpu(double* y);
};
void Micro::build(){
        std::vector<double> v2(10, 1);
        std::vector< std::vector<double> > mt(10, v2);
        p_density_vectors = &mt;
}


// this is "solver"
void solver(Micro &en){
        if (en.data_bool == false){
                std::cout << "transfering to device" << std::endl;
                en.transfer_3D();
                std::cout << "-continuing-" << std::endl;
                en.data_bool = true;
        }
        else {
                //en.translate_vector_to_array();
        }

        en.diffusion_gpu();
}

void Micro::redirect(){
        solver(*this);
}

void Micro::transfer_3D(){
        // start gpu_p_density_vectors
        const int bin_p_density_vectors = (*p_density_vectors).size();
        gpu_p_density_vectors = new double*[bin_p_density_vectors];

        sizes_p_density_vectors = new int[bin_p_density_vectors];
        #pragma acc enter data copyin(this[0:1])
        #pragma acc enter data create(this->gpu_p_density_vectors[0:bin_p_density_vectors][0:0])

        for (int i = 0; i < bin_p_density_vectors; i++){
                int sze = (*p_density_vectors)[i].size();
                sizes_p_density_vectors[i] = sze;
                gpu_p_density_vectors[i] = (*p_density_vectors)[i].data();
                #pragma acc enter data copyin(this->gpu_p_density_vectors[i:1][:sze])
        }
        #pragma acc enter data copyin(this->sizes_p_density_vectors[:bin_p_density_vectors])
        // end gpu_p_density_vectors

}

void Micro::translate_vector_to_array(){
        /* translate_vector_to_array is actually an update between the two versions
         * of p_density_vector (DEVICE = gpu_p_density_vector, HOST = p_density_vector),
         * updating the device with the current values of the 
         * host p_density_vector
         */
        const int bin_p_density_vectors = (*p_density_vectors).size();

        for (int i = 0; i < bin_p_density_vectors; i++){
                int sze = (*p_density_vectors)[i].size();
                sizes_p_density_vectors[i] = sze;
                gpu_p_density_vectors[i] = (*p_density_vectors)[i].data();
                #pragma acc update device(this->gpu_p_density_vectors[i:1][:sze])
        }
        #pragma acc update device(this->sizes_p_density_vectors[:bin_p_density_vectors])

}

void Micro::translate_array_to_vector(){
        /* translate_array_to_vector is actually an update between the two versions
         * of p_density_vector (DEVICE = gpu_p_density_vector, HOST = p_density_vector),
         * updating the host with the current values of the 
         * device gpu_p_density_vector
         */
        const int bin_p_density_vectors = (*p_density_vectors).size();

        for (int i = 0; i < bin_p_density_vectors; i++){
                int sze = (*p_density_vectors)[i].size();
                #pragma acc update host(this->gpu_p_density_vectors[i:1][:sze])
        }
        #pragma acc update host(this->sizes_p_density_vectors[:bin_p_density_vectors])

        for (int i = 0; i < bin_p_density_vectors; i++) {
                for (int j = 0; j < this->sizes_p_density_vectors[i]; j++) {
                        (*p_density_vectors)[i][j] = this->gpu_p_density_vectors[i][j];
                }
        }
}

void Micro::diffusion_gpu(){
        #pragma acc parallel loop present(gpu_p_density_vectors, sizes_p_density_vectors)
        for (int i = 0; i < 10; i++){
                axpy_gpu(gpu_p_density_vectors[i]);
        }
}

#pragma acc routine
void Micro::axpy_gpu(double* y){
        for (int j = 0; j < 10; j++){
                y[j] = 2;
        }
}

int main (){
        Micro e1;

	//TRY 1
//        e1.build();
        // VS
        // TRY 2
//        std::vector<double> v2(10, 1);
//       std::vector< std::vector<double> > mt(10, v2);
//        e1.p_density_vectors = &mt;

        bool init = true;

        // work loop
        for (int i = 0; i < 25; i++) {
                if ((i == 24) && (init == false)){
                        std::cout << "updating host" << std::endl;
                        e1.translate_array_to_vector();
                        std::cout << "-continuing from host update-" << std::endl;
                }
        init = false;

        e1.redirect();
        }

        // print out p_density_vector after "work loop"
        for (int i = 0; i < (*e1.p_density_vectors).size(); i ++){
                for (int j = 0; j < (*e1.p_density_vectors)[i].size(); j ++){
                        double u = (*e1.p_density_vectors)[i][j];
                        std::cout << u << " ";
                }
                std::cout << std::endl;
        }
}

Hi Matt,

If you comment out them individually there are two different results. TRY 1 does not work and results in a segfault but other times Ive tried to isolate this it didnt segfault but gave strange numbers. TRY 2 does work with no segfault and correct values, but it is the same code, just not in a class function.

They actually aren’t the same code. In build, “mt” is a local variable located on the stack. Once build returns, mt goes out of scope and p_density_vectors is pointing to some random stack value.
This explains why it sometimes seg faults and sometimes gives odd values. Just depends on what’s left over on the stack.

I ran the code through gdb and saw that the value returned from “(*p_density_vectors).size();” was -5, which then triggered an allocation error when creating the gpu_p_density_vectors array.

With TRY 2, mt is also on the stack but since it’s in main, it doesn’t go out of scope so is still valid.

Note that this is a problem with the host code and occurs with or without OpenACC enabled.

To fix, dynamically allocate mt so it’s on the heap.

void Micro::build(){
        std::vector<double> v2(10, 1);
//        std::vector< std::vector<double> > mt(10, v2);
        std::vector< std::vector<double> > * mt;
        mt = new std::vector< std::vector<double> >(10,v2);
        p_density_vectors = mt;
}

Hope this helps,
Mat

Hi Mat,

Ah yes you are right about that, my mistake! Additionally I have been exploring this error a lot, and it is still very strange that my real code is not transferring back correctly from the GPU, but works perfectly fine with -ta=multicore. I have exhausted all branches of investigation to pinpoint the error, have you ever seen a code that runs parallel and outputs correctly in CPU, but doesnt output correct with GPU and could potentially offer advice where to look? I dont think it is in the transfer statements, as you said before they look right, and its not in the functions call when its time to output, or else the CPU version wouldnt work either. And its not a parallel race memory over-writing problem or else the CPU parallel version wouldnt work either.

Thanks again for your time!

have you ever seen a code that runs parallel and outputs correctly in CPU, but doesnt output correct with GPU and could potentially offer advice where to look?

Most likely it’s a data issue where some variable or variables aren’t getting synchronized between the host and device.

Assuming you allocate your data, you can try using CUDA Unified Memory (-ta=tesla:managed) to have the CUDA runtime manage the data movement. If that works, then you know definitely it’s a data management issue.

Typically when I add data directives (i.e. manually managing the data and not using “managed”), I’ll start by adding unstructured data regions (#pragma acc enter data create(arr[:size]) just after I allocate the host data. This way the device copy of the variable matches the same lifetime and scope as the host variable. Put a “#pragma acc exit data delete(arr)” just before deallocating the variable.

Next, add “#pragma acc update device(arr[:size])” just before each of the compute regions and a "#pragma acc update self(arr[:size]), just after. This should ensure that the data is synchronized correctly, albeit not good for performance.

Next, start moving the “update” directives over wider sections of code. If you find that you start getting wrong answers, look to see if the section of code that uses the variable can be offloaded, even if it’s run sequentially, so you can save the cost of copying data. Eventually you want to offload all sections of code that uses these variables (sans I/O) so that no data movement is needed.

Hope this helps,
Mat

Hi Mat,

Thank you for your help with this! I found the error thanks to your suggestion of using -ta=tesla:managed, which resulted in the correct output, so I knew it was a data movement problem. Turns out after a lot of testing and searching I had an extra pragma acc copyin when I shouldnt have for a double array because I was so used to needing it for the 2d double arrays.

Thanks!

Great! Glad you found the issue.