Invalid result when using multiple GPUs with openmp threads

I’ve been working extensively on optimizing and debugging the MCX kernel for multi-GPU execution using OpenACC, but I’m still encountering incorrect results. The issue seems specific to the multi-GPU setup, as the kernel runs correctly on a single GPU.

Observations:

  • When running with multiple GPUs, the detected photon data (detdata.detphotondata) is consistently zero, even though energy escape and absorption appear normal.
  • The same kernel, when run on a single GPU, provides expected results (non-zero detected photons).
  • Even enforcing explicit atomic updates in OpenACC for detphotonbuffer didn’t help.
    Possible Issue:
    One potential culprit might be detphotonbuffer, which may be getting overwritten across GPUs or not being correctly assigned per photon/thread. However, I haven’t been able to pinpoint the exact reason why this only happens with multiple GPUs.

If anyone has insights into potential causes or best practices for handling device-local buffers in OpenACC multi-GPU environments, I would really appreciate any guidance.

template<const bool isreflect, const bool issavedet>
double MCX_kernel(json& cfg, const MCX_param& gcfg, MCX_volume<int>& inputvol, MCX_volume<float>& outputvol, float4* detpos, MCX_medium* prop, MCX_detect& detdata) {

    double total_energyescape = 0.0;
    std::srand(!cfg["Session"].contains("RNGSeed") ? 1648335518 : (cfg["Session"]["RNGSeed"].get<int>() > 0 ? cfg["Session"]["RNGSeed"].get<int>() : std::time(0)));

    const uint64_t nphoton = cfg["Session"].value("Photons", 1000000);
    const dim4 seeds = {(uint32_t)std::rand(), (uint32_t)std::rand(), (uint32_t)std::rand(), (uint32_t)std::rand()};
    const float4 pos = {cfg["Optode"]["Source"]["Pos"][0].get<float>(), cfg["Optode"]["Source"]["Pos"][1].get<float>(), cfg["Optode"]["Source"]["Pos"][2].get<float>(), 1.f};
    const float4 dir = {cfg["Optode"]["Source"]["Dir"][0].get<float>(), cfg["Optode"]["Source"]["Dir"][1].get<float>(), cfg["Optode"]["Source"]["Dir"][2].get<float>(), 0.f};

    // Get number of available GPUs
    int num_devices = 1;
#ifdef _OPENACC
    num_devices = acc_get_num_devices(acc_get_device_type());
#endif

    // Split photons across GPUs
    const uint64_t photons_per_gpu = nphoton / num_devices;

    // Create GPU buffers once (persistent memory)
    std::vector<MCX_volume<float>> gpu_outputs(num_devices, outputvol);
    std::vector<MCX_detect> gpu_detdata(num_devices, detdata);

#ifdef _OPENACC
    #pragma acc enter data copyin(pos, dir, seeds, gcfg, inputvol, prop[0:gcfg.mediumnum], detpos[0:gcfg.detnum], inputvol.vol[0:inputvol.dimxyzt])
    #pragma acc enter data create(outputvol.vol[0:outputvol.dimxyzt], detdata.detphotondata[0:detdata.maxdetphotons*detdata.detphotondatalen])
#endif

    #pragma omp parallel for reduction(+:total_energyescape) num_threads(num_devices)
    for(int dev = 0; dev < num_devices; ++dev) {
        double device_energy = 0.0;
        MCX_rand ran(seeds.x, seeds.y, seeds.z, seeds.w);
        MCX_photon p(pos, dir, ran, gcfg);

#ifdef GPU_OFFLOAD
        #pragma acc set device_num(dev)
#endif

        const uint64_t start = dev * photons_per_gpu;
        const uint64_t end = (dev == num_devices-1) ? nphoton : start + photons_per_gpu;

#ifdef _OPENACC
        float* detphotonbuffer = nullptr; // Declare before OpenACC region

        #pragma acc enter data create(detphotonbuffer[0:detdata.ppathlen * issavedet])
#endif

#ifdef GPU_OFFLOAD
        #pragma acc parallel loop gang vector collapse(1) reduction(+:device_energy) firstprivate(ran, p)
#endif
        for(uint64_t i = start; i < end; ++i) {
#ifdef _OPENACC
            #pragma acc data create(detphotonbuffer[0:detdata.ppathlen * issavedet])
            {
                #pragma acc parallel loop
                for (int j = 0; j < detdata.ppathlen * issavedet; j++) {
                    detphotonbuffer[j] = 0.0f;
                }

                ran.reseed(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
                p.launch(pos, dir, ran, gcfg);
                p.run<isreflect, issavedet>(inputvol, gpu_outputs[dev], prop, detpos,
                                            gpu_detdata[dev], detphotonbuffer, ran, gcfg);
                device_energy += p.pos.w;

                if(issavedet){
                    for(int j = 0; j < detdata.ppathlen * issavedet; j++){
                        #pragma acc atomic update
gpu_detdata[dev].detphotondata[j + i * detdata.ppathlen * issavedet] += detphotonbuffer[j];
                    }
                }
            } // End of OpenACC data region
#endif
        }

#ifdef _OPENACC
        #pragma acc exit data delete(detphotonbuffer[0:detdata.ppathlen * issavedet])
#endif

        total_energyescape += device_energy;
    }

#ifdef _OPENACC
    // Copy back device results only once
    #pragma acc update self(outputvol.vol[0:outputvol.dimxyzt], detdata.detphotondata[0:detdata.maxdetphotons*detdata.detphotondatalen])

    // Cleanup GPU memory
    #pragma acc exit data delete(pos, dir, seeds, gcfg, inputvol, prop[0:gcfg.mediumnum], detpos[0:gcfg.detnum], inputvol.vol[0:inputvol.dimxyzt])
    #pragma acc exit data delete(outputvol.vol[0:outputvol.dimxyzt], detdata.detphotondata[0:detdata.maxdetphotons*detdata.detphotondatalen])
#endif

    // Merge results from all GPUs (if necessary)
    for(int dev = 0; dev < num_devices; ++dev) {
        for(size_t i = 0; i < outputvol.dimxyzt; ++i) {
            outputvol.vol[i] += gpu_outputs[dev].vol[i];
        }
        if(issavedet) {
            for(size_t i = 0; i < detdata.maxdetphotons*detdata.detphotondatalen; ++i) {
                detdata.detphotondata[i] += gpu_detdata[dev].detphotondata[i];
            }
        }
    }

    return total_energyescape;
}

Hi mti6319 and welcome!

It’s difficult to tell for sure without a reproducer, but here’s things with the code that could be problematic.

I’m presuming “GPU_OFFLOAD” isn’t defined given you should get a compilation error if it were. You can’t include a data directive inside a compute region, nor do we support nested parallelism (i.e. a compute region inside another compute region).

However this means that the device setting isn’t used and all OpenMP threads are using the same device.

But more importantly, only the initialization of detphonbuffer is actually getting offloaded. The rest of the code would be executing on the host. Given detphotonbuffer is NULL, I’m not sure how this code is running successfully at all. I’d think it would seg fault. So something is off between how I’m reading the code and what you describe.

If for some reason it’s not seg faulting due to the NULL pointer, detphonbuffer isn’t initialized on the host, so you’d be getting random values. It may happen to be zero in the single thread case, but not in the multithread case.

Let’s presume “GPU_OFFLOAD” is defined and the compiler is ignoring the inner data region and parallel loop, and the code is using multiple devices. I’d look towards the data directives before the OpenMP loop. These would only get applied to the default device and other devices wouldn’t have copies of them. I’d probably remove the scalars from the enter data directive so they become first private to the parallel loop. Though the arrays should be moved to the device after the setting the device number. I’m not seeing “outputvol” in the compute region, so not sure if it’s needed or not.

For “detphotonbuffer”, It looks like you’re wanting this to be a device-only array but variables used in a data directive must have the same size and shape on both the host and device. But here the buffer isn’t allocated on the host. It may “work”, but technically illegal and should cause seg faults if you were to ever target a multicore CPU.

The better way to do this is by allocating the data via “acc_malloc” and then adding the “deviceptr” clause on the structured data region or parallel loop. acc_malloc maps to cudaMalloc when targeting a GPU or malloc when targeting multicore CPU, so works in both cases.

Something like:

#ifdef _OPENACC
#include "openacc.h"
#endif 
....
        float* detphotonbuffer = (float*) acc_malloc(sizeof(float)*detdata.ppathlen * issavedet); // Declare before OpenACC region
         for(uint64_t i = start; i < end; ++i) {
#ifdef _OPENACC
            #pragma acc data deviceptr(detphotonbuffer)
            {
                #pragma acc parallel loop
...
    acc_free(detphotonbuffer);
....

Sorry if this answer is a bit rambling, but I’m really not sure how this code works at all. From what I see, it shouldn’t. Though if you can provide a minimal reproducing example, I can look at it again to get you a more definitive answer.

-Mat

Hi Mat,

Thank you for your valuable feedback and recommendations. As a beginner in GPU programming, I deeply appreciate your guidance as I work to refine and understand these concepts.

To provide clarity, here’s what the multi-GPU version of the code is designed to do:

  • Objective: The goal of this multi-GPU implementation is to distribute the total number of photons across multiple GPUs, process them in parallel, and aggregate the results to calculate energyescape while updating outputvol and (if enabled) detdata.detphotondata. Each GPU handles an equal share of photons, with the last GPU processing any remaining photons.
  • Workflow:
    1. Initialization: Device-specific memory is allocated for outputvol, detphotonbuffer, and, if issavedet is enabled, detdata.detphotondata. These buffers are initialized to zero on each GPU.
    2. Parallel Processing:
    • Each GPU runs its allocated photons independently, simulating photon transport and interaction.
    • Intermediate results, like photon paths in detphotonbuffer, are computed locally per GPU.
    1. Reduction and Aggregation: After computation, the results from each GPU are aggregated on the host. For outputvol and detdata.detphotondata, the buffers are summed across GPUs to produce the final result.
    2. Cleanup: Device memory is freed once the computation is complete.

I incorporated all your recommendations into this implementation:

  • Used acc_malloc for allocating device-only memory for detphotonbuffer, d_vol, and d_detphotondata to ensure compatibility with GPUs and multicore CPUs.
  • Leveraged deviceptr to correctly attach device pointers to the compute regions.
  • Removed scalars from the enter data directives and ensured arrays are explicitly moved to the device after setting the GPU number.
  • Implemented thread-private allocation for detphotonbuffer to avoid race conditions and used #pragma acc atomic for updates to shared memory buffers.
  • Ensured proper initialization and cleanup of all device buffers to avoid undefined behavior.

Despite these changes, the multi-GPU version is still producing incorrect results. It compiles and runs without errors, but the output deviates from expectations when multiple GPUs are involved. The single-GPU version works perfectly, producing the correct results.

Below is the updated code for the multi-GPU version with your suggestions applied:

template<const bool isreflect, const bool issavedet>
double MCX_kernel(json& cfg, const MCX_param& gcfg, MCX_volume<int>& inputvol, MCX_volume<float>& outputvol, float4* detpos, MCX_medium* prop, MCX_detect& detdata) {
    double energyescape = 0.0;
    std::srand(cfg["Session"].value("RNGSeed", std::time(0)));

    const uint64_t nphoton = cfg["Session"].value("Photons", 1000000);
    const dim4 seeds = {(uint32_t)std::rand(), (uint32_t)std::rand(), (uint32_t)std::rand(), (uint32_t)std::rand()};
    const float4 pos = {cfg["Optode"]["Source"]["Pos"][0].get<float>(), cfg["Optode"]["Source"]["Pos"][1].get<float>(), cfg["Optode"]["Source"]["Pos"][2].get<float>(), 1.f};
    const float4 dir = {cfg["Optode"]["Source"]["Dir"][0].get<float>(), cfg["Optode"]["Source"]["Dir"][1].get<float>(), cfg["Optode"]["Source"]["Dir"][2].get<float>(), 0.f};
    const int num_gpus = cfg["Session"].value("NumGPUs", acc_get_num_devices(acc_device_nvidia));
    const uint64_t photons_per_gpu = nphoton / num_gpus;
    const uint64_t photons_remainder = nphoton % num_gpus;

    #pragma omp parallel num_threads(num_gpus) reduction(+:energyescape)
    {
        int dev = omp_get_thread_num();
        uint64_t start_idx = dev * photons_per_gpu;
        uint64_t end_idx = (dev + 1) * photons_per_gpu;
        if (dev == num_gpus - 1) end_idx += photons_remainder; // Handle remainder

#ifdef _OPENACC
        acc_set_device_num(dev, acc_device_nvidia);

        // Allocate device-specific buffers
        float* detphotonbuffer = nullptr;
        if (issavedet) {
            detphotonbuffer = (float*)acc_malloc(sizeof(float) * detdata.ppathlen);
        }

        MCX_volume<float> local_outputvol = outputvol;
        MCX_detect local_detdata = detdata;

        float* d_vol = (float*)acc_malloc(sizeof(float) * outputvol.dimxyzt);
        local_outputvol.vol = d_vol;

        if (issavedet) {
            float* d_detphotondata = (float*)acc_malloc(sizeof(float) * detdata.maxdetphotons * detdata.detphotondatalen);
            local_detdata.detphotondata = d_detphotondata;

            #pragma acc enter data copyin(local_detdata.detphotondata[:detdata.maxdetphotons * detdata.detphotondatalen])
        }

        // Initialize volumes to zero
        #pragma acc parallel loop deviceptr(d_vol)
        for (size_t idx = 0; idx < outputvol.dimxyzt; ++idx) {
            d_vol[idx] = 0.f;
        }

        if (issavedet) {
            #pragma acc parallel loop deviceptr(detphotonbuffer)
            for (size_t idx = 0; idx < detdata.ppathlen; ++idx) {
                detphotonbuffer[idx] = 0.f;
            }
        }

        #pragma acc enter data create(local_outputvol.vol[:outputvol.dimxyzt])

        #pragma acc data copyin(inputvol.vol[:inputvol.dimxyzt], \
                               prop[:gcfg.mediumnum], detpos[:gcfg.detnum]) \
                     deviceptr(local_outputvol.vol, detphotonbuffer)
        {
            #pragma acc parallel loop reduction(+:energyescape) firstprivate(seeds, pos, dir, gcfg)
            for (uint64_t i = start_idx; i < end_idx; ++i) {
                MCX_rand ran(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
                MCX_photon p(pos, dir, ran, gcfg);

                if (issavedet) {
                    #pragma acc loop seq
                    for (int idx = 0; idx < detdata.ppathlen; ++idx) {
                        detphotonbuffer[idx] = 0.f;
                    }
                }

                p.launch(pos, dir, ran, gcfg);
                p.run<isreflect, issavedet>(inputvol, local_outputvol, prop, detpos, local_detdata, detphotonbuffer, ran, gcfg);

                energyescape += p.pos.w;

                if (issavedet) {
                    #pragma acc loop seq
                    for (int j = 0; j < detdata.ppathlen; ++j) {
                        const size_t det_idx = j + (i - start_idx) * detdata.ppathlen;
                        #pragma acc atomic
                        local_detdata.detphotondata[det_idx] += detphotonbuffer[j];
                    }
                }
            }
        }

        // Copy data back to host
        #pragma acc update self(local_outputvol.vol[:outputvol.dimxyzt])
        if (issavedet) {
            #pragma acc update self(local_detdata.detphotondata[:detdata.maxdetphotons * detdata.detphotondatalen])
        }

        // Accumulate results on the host
        for (size_t idx = 0; idx < outputvol.dimxyzt; ++idx) {
            outputvol.vol[idx] += local_outputvol.vol[idx];
        }

        if (issavedet) {
            for (size_t idx = 0; idx < (end_idx - start_idx) * detdata.ppathlen; ++idx) {
                detdata.detphotondata[idx] += local_detdata.detphotondata[idx];
            }
        }

        // Cleanup
        acc_free(d_vol);
        if (issavedet) {
            acc_free(local_detdata.detphotondata);
            acc_free(detphotonbuffer);
        }
#endif
    }

    return energyescape;
}

And here is the single GPU version with the correct behavior:

template<const bool isreflect, const bool issavedet>
double MCX_kernel(json& cfg, const MCX_param& gcfg, MCX_volume<int>& inputvol, MCX_volume<float>& outputvol, float4* detpos, MCX_medium* prop, MCX_detect& detdata) {
    double energyescape = 0.0;
    std::srand(!(cfg["Session"].contains("RNGSeed")) ? 1648335518 : (cfg["Session"]["RNGSeed"].get<int>() > 0 ? cfg["Session"]["RNGSeed"].get<int>() : std::time(0)));
    const uint64_t nphoton = cfg["Session"].value("Photons", 1000000);
    const dim4 seeds = {(uint32_t)std::rand(), (uint32_t)std::rand(), (uint32_t)std::rand(), (uint32_t)std::rand()};  //< TODO: need to implement per-thread ran object
    const float4 pos = {cfg["Optode"]["Source"]["Pos"][0].get<float>(), cfg["Optode"]["Source"]["Pos"][1].get<float>(), cfg["Optode"]["Source"]["Pos"][2].get<float>(), 1.f};
    const float4 dir = {cfg["Optode"]["Source"]["Dir"][0].get<float>(), cfg["Optode"]["Source"]["Dir"][1].get<float>(), cfg["Optode"]["Source"]["Dir"][2].get<float>(), 0.f};
    MCX_rand ran(seeds.x, seeds.y, seeds.z, seeds.w);
    MCX_photon p(pos, dir, ran, gcfg);
#ifdef _OPENACC
    int ppathlen = detdata.ppathlen;
    float* detphotonbuffer = (float*)calloc(sizeof(float), detdata.ppathlen);
#endif
#ifdef GPU_OFFLOAD
    const int totaldetphotondatalen = issavedet ? detdata.maxdetphotons * detdata.detphotondatalen : 1;
    const int deviceid = cfg["Session"].value("DeviceID", 1) - 1, gridsize = cfg["Session"].value("ThreadNum", 100000) / cfg["Session"].value("BlockSize", 64);
#ifdef _LIBGOMP_OMP_LOCK_DEFINED
    const int blocksize = cfg["Session"].value("BlockSize", 64) / 32;  // gcc nvptx offloading uses {32,teams_thread_limit,1} as blockdim
#else
    const int blocksize = cfg["Session"].value("BlockSize", 64); // nvc uses {num_teams,1,1} as griddim and {teams_thread_limit,1,1} as blockdim
#endif
    _PRAGMA_OMPACC_COPYIN(pos, dir, seeds, gcfg, inputvol) _PRAGMA_OMPACC_COPYIN(prop[0:gcfg.mediumnum], detpos[0:gcfg.detnum], inputvol.vol[0:inputvol.dimxyzt])
    _PRAGMA_OMPACC_COPY(outputvol, detdata) _PRAGMA_OMPACC_COPY(outputvol.vol[0:outputvol.dimxyzt], detdata.detphotondata[0:totaldetphotondatalen])
    _PRAGMA_OMPACC_GPU_LOOP(gridsize, blocksize, deviceid, firstprivate(detphotonbuffer[0:ppathlen]), reduction(+ : energyescape) firstprivate(ran, p))
#else  // GPU_OFFLOAD
    _PRAGMA_OMPACC_HOST_LOOP(reduction(+ : energyescape) firstprivate(ran, p))
#endif

    for (uint64_t i = 0; i < nphoton; i++) {
#ifndef _OPENACC
#ifdef USE_MALLOC
        float* detphotonbuffer = (float*)malloc(sizeof(float) * detdata.ppathlen * issavedet);
        memset(detphotonbuffer, 0, sizeof(float) * detdata.ppathlen * issavedet);
#else
        float detphotonbuffer[issavedet ? 10 : 1] = {};   // TODO: if changing 10 to detdata.ppathlen, speed of nvc++ built binary drops by 5x to 10x
#endif
#else
        memset(detphotonbuffer, 0, sizeof(float) * detdata.ppathlen * issavedet);
#endif
        ran.reseed(seeds.x ^ i, seeds.y | i, seeds.z ^ i, seeds.w | i);
        p.launch(pos, dir, ran, gcfg);
        p.run<isreflect, issavedet>(inputvol, outputvol, prop, detpos, detdata, detphotonbuffer, ran, gcfg);
        energyescape += p.pos.w;
#ifndef _OPENACC
#ifdef USE_MALLOC
        free(detphotonbuffer);
#endif
#endif
    }

#ifdef _OPENACC
    free(detphotonbuffer);
#endif
    return energyescape;
}

thanks agian,

Again, I don’t know if these are causing your issue, but I do see more potential problems. Note, I’m assuming your not using Unified Memory.

        if (issavedet) {
            float* d_detphotondata = (float*)acc_malloc(sizeof(float) * detdata.maxdetphotons * detdata.detphotondatalen);
            local_detdata.detphotondata = d_detphotondata;

            #pragma acc enter data copyin(local_detdata.detphotondata[:detdata.maxdetphotons * detdata.detphotondatalen])
        }

This should be changed to just:

#pragma acc enter data copyin(local_detdata)

This will do a shallow copy of the parent, though given “detphotondata” is a device address, the shallow copy if fine. By putting in the array, the code isn’t using the address from acc_malloc, but rather a new address space.

#pragma acc enter data create(local_outputvol.vol[:outputvol.dimxyzt])

For a deep copy, the parent object needs to be copied to the device as well prior to the dynamic data members. Then when the dynamic data member is created on the device, the compiler performs an implicit “attach”, which fills in the correct device address for the device copy of the member.

i.e. add:

#pragma acc enter data create(local_outputvol, local_outputvol.vol[:outputvol.dimxyzt])

Similar for “inputvol”. Also, I’d then move “local_output” to a “present” clause. Again, the shallow copy will move the device address as part of the object so no need to use deviceptr.

#pragma acc data copyin(inputvol, inputvol.vol[:inputvol.dimxyzt], \
                     prop[:gcfg.mediumnum], detpos[:gcfg.detnum]) \
                     present(local_outputvol), deviceptr(detphotonbuffer)

For this code:

        if (issavedet) {
            #pragma acc update self(local_detdata.detphotondata[:detdata.maxdetphotons * detdata.detphotondatalen])
        }

Above, “detphotondata” is a device pointer so doesn’t have a corresponding host array associated with it. It probably “works” because of the additional acc copyin of the array, but this really should seg fault since who knows what host address is being used. It might just be luck.

Though since you do need this array back on the host to perform the final reduction. So I’ll revise the earlier code, and suggest you not use acc_malloc and instead malloc the memory. Then the update will work as intended.

   if (issavedet) {
        float* d_detphotondata = (float*)malloc(sizeof(float) * detdata.maxdetphotons * detdata.detphotondatalen);
        local_detdata.detphotondata = d_detphotondata;

        #pragma acc enter data copyin(local_detdata,local_detdata.detphotondata[:detdata.maxdetphotons * detdata.detphotondatalen])
    }

Then later, instead of freeing the memory, use an “exit data” directive:

        // Cleanup
        acc_free(d_vol);
        if (issavedet) {
            acc_free(detphotonbuffer);
#pragma acc exit data delete(local_detdata.detphotondata, locat_detdata)
        }

You’ll also want and exit data for local_outputvol to avoid an memory leak on the device.

For this bit of code:

        if (issavedet) {
            for (size_t idx = 0; idx < (end_idx - start_idx) * detdata.ppathlen; ++idx) {
                detdata.detphotondata[idx] += local_detdata.detphotondata[idx];
            }
        }

First, I would have expect this to segv due to the device pointer and no associated host address, but assuming I’m missing something there, the code as written will have a potential race condition given multiple OpenMP threads are writing to the same memory location. You’ll want to add an atomic update.

Note, I personally don’t like using OpenMP for multi-GPU programming. I much prefer using MPI+OpenACC since this gives a one-to-one mapping between process and GPU plus allows for multi-node. With multi-thread, it’s tricky to get memory correct given the multiple discreate memories involved.