MGPU Library

During the last weeks I have been working on a multi-GPU library. It was originally designed to speed up MRI image reconstruction but I extracted the generic parts and put together a library:

The MGPU library strives to simplify the implementation of high performance applications and algorithms on multi-GPU systems. Its main goal is both to abstract platform dependent functions and vendor specific APIs, as well as simplify communication between different compute elements. The library is currently an alpha release containing only limited yet already useful functionality.

It is a C++ library so you need a reasonable compiler. It is also not tested under any Windows platform yet (due to lack of hardware) but I managed to compile it in a virtual machine using Windows 7 x64. It is developed for Ubuntu x64.

Here’s the code:
And some documentation:
And some archives:

Let me know what you think, if you encounter problems or find any bugs.


Here’s a example of the features the library provides. It shows how a simple axpy kernel can be calculated on multiple GPUs in a system in a convenient way:

#include <stdlib.h>

#include <algorithm>

#include <vector>

#include <mgpu/container/seg_dev_vector.hpp>

#include <mgpu/transfer/copy.hpp>

#include <mgpu/invoke_kernel.hpp>

#include <mgpu/synchronization.hpp>

using namespace mgpu;

// generate random number

float random_number() { return ((float)(rand()%100) / 100); }

// axpy CUDA kernel code

__global__ void axpy_kernel(

  float const a, float * X, float * Y, std::size_t size)


  unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

  if (i < size) Y[i] = (a * X[i]) + Y[i];


// axpy CUDA kernel launcher

void axpy(float const a, dev_range<float> X, dev_range<float> Y)


  int threads = 256;

  int blocks = (X.size() + T - 1) / T;

  axpy_kernel<<< blocks, threads >>>(a, X.get_raw_pointer(), Y.get_raw_pointer(), Y.size());


int main(void)


  const std::size_t size = 1024;

  environment e;


    std::vector<float> X(size), Y(size);

    float const a = .42;

    std::generate(X.begin(), X.end(), random_number);

    std::generate(Y.begin(), Y.end(), random_number);

seg_dev_vector<float> X_dev(size), Y_dev(size);

    copy(X, X_dev.begin()); copy(Y, Y_dev.begin());

// calculate on devices

    invoke_kernel_all(axpy, a, X_dev, Y_dev);

    copy(Y_dev, Y.begin());


    // result is now in Y



The [font=“Courier New”]seg_dev_vector[/font] is a container that is distributed across all GPUs in the system. invoke_kernel_all invokes the kernel caller [font=“Courier New”]axpy[/font] for all devices. The distributed vector is translated to a local device range the kernel caller can work with. The first [font=“Courier New”]copy[/font] in this example is equivalent to a scatter (a host vector is scattered across all devices) and the second copy is a [font=“Courier New”]copy[/font] is a gather: the distributed data is gathered into one host vector.