Xtream: A speedy CUDA C++ productivity framework

I’m making our CUDA C++ productivity framework, Xtream, publicly available. It’s essentially a thin layer of C++11 features on top of CUDA, simplifying kernel writing and memory management.

A brief introduction can be found in these slides https://gitlab.com/jipe4153/xtream/blob/master/doc/xtream_intro.pdf

The source code can be found here:

# Make sure you have a gitlab user with SSH keys added!
git clone git@gitlab.com:jipe4153/xtream.git
cd xtream
# build tests & samples
./build.sh
# doxygen doc
cd build
make doc

The code is completely header-based so compilation is only needed for the tests and sample applications.

Xtream has been of great use to a lot of people internally over the years and I think a community approach is the way to go.

Some code snippets:

Vectoraddition:

int N = 1024*256;
// Vectors A,B,C
DeviceBuffer<int> A(N); // Create 1xN vector
DeviceBuffer<int> B(N);
DeviceBuffer<int> C(N);
// Set B to 1 and C to '2' using for each lambda call
B.for_each([=]__device__(int& val){val=1;});
C.for_each([=]__device__(int& val){val=2;});
//
// Setup vector addition kernel:
auto vectorAdditionKernel = xpl_device_lambda()
{
    // Index in X-dimension
    int j = device::getX(); // same as threadIdx.x + blockIdx.x*blockDim.x
    if( j < A.cols() )
    {
        A(j) = B(j) + C(j);
    }
};
// Execute the kernel, create grid mapped to output buffer 'A'
device::execute(A.size(), vectorAdditionKernel);
// Test:
// We expect 'A' to be 1+2
bool ok = true;
for(int j = 0; j < N; j++)
{
    // Read 'A' values directly from host side code
    if( A(j) != 3)
    {
        ok = false;
        break;
    }
}
std::cout << "\n Status: " << (ok ? "PASS" : "FAIL") << "\n";

Mean filter using shared memory:

DeviceBuffer<float> in(rows,cols);
DeviceBuffer<float> out(in.size());

in.for_each([=]__device__(float& val){val=1.0f;});
out.for_each([=]__device__(float& val){val=0.0f;});

// define grid, map to output size
auto grid = DeviceGrid(out.size());
// Setup shared memory, block stencil defines access to neighbooring elements:
Smem<float> smem( grid, BlockStencil(1,1));
//
// Setup vector increment kernel:
auto meanFilter = xpl_device_lambda()
{
    // Row/col indices (i,j)
    int i = device::getY();
    int j = device::getX();
    // Cache input data in shared memory
    smem.cache(in);
    //
    // Compute mean in 3x3 neighboorhood
    float sum = 0.0f;
    for(int y : {-1,0,1})
        for(int x : {-1,0,1})
            sum += smem(threadIdx.y + y, threadIdx.x + x);

    float mean = sum/9.0f;
    // output
    if( i < out.rows() && j < out.cols())
        out(i,j) = mean;

};
// Execute the kernel, create grid mapped to output buffer 'A'
device::execute(grid, meanFilter);

FAQ

  • Isn’t this just like Thrust?

See comparisons between Thrust and SYCL (OpenCL) in the intro PDF. While there is some feature overlap they are not designed for the same thing and there is a long list of differing features. Basically:

  • Xtream is aimed at people who need to write CUDA kernels and more customizable code.
  • Use Thrust whenever your problem fits the thrust STL like iterators and algorithms.
  • Integration between Xtream/Thrust/CUDA is simple.

Sounds interesting!

  • Isn’t this just like Thrust?

What about a comparison to CUB (which is more low level than Thrust)?

I do not know the reason why it is placed on Gitlab (Microsoft?), but at least the mirror on Github will definitely give more followers to this project.

As you’ve noted, CUB:s scope is generally more low level (block, and warp level primitives), the device-wide primitives of CUB do seem to slightly overlap Thrust (ex reduce, scan, sort).

I frequently use CUB code as part of my Xtream kernels for block and warp scoped operations.

I was unaware of there being a github mirror? There were several advantages with going with gitlab (price and features) at the time of that decision… I do think github (which is now owned by M$) might be catching up again.

Thanks for the feedback!

PS
There are a lot of updates in the pipeline WRT improved build system and features