Xtream: A speedy CUDA C++ productivity framework

Jimmy_Pettersson · October 11, 2018, 11:37am

I’m making our CUDA C++ productivity framework, Xtream, publicly available. It’s essentially a thin layer of C++11 features on top of CUDA, simplifying kernel writing and memory management.

A brief introduction can be found in these slides https://gitlab.com/jipe4153/xtream/blob/master/doc/xtream_intro.pdf

The source code can be found here:

# Make sure you have a gitlab user with SSH keys added!
git clone git@gitlab.com:jipe4153/xtream.git
cd xtream
# build tests & samples
./build.sh
# doxygen doc
cd build
make doc

The code is completely header-based so compilation is only needed for the tests and sample applications.

Xtream has been of great use to a lot of people internally over the years and I think a community approach is the way to go.

Some code snippets:

Vectoraddition:

int N = 1024*256;
// Vectors A,B,C
DeviceBuffer<int> A(N); // Create 1xN vector
DeviceBuffer<int> B(N);
DeviceBuffer<int> C(N);
// Set B to 1 and C to '2' using for each lambda call
B.for_each([=]__device__(int& val){val=1;});
C.for_each([=]__device__(int& val){val=2;});
//
// Setup vector addition kernel:
auto vectorAdditionKernel = xpl_device_lambda()
{
    // Index in X-dimension
    int j = device::getX(); // same as threadIdx.x + blockIdx.x*blockDim.x
    if( j < A.cols() )
    {
        A(j) = B(j) + C(j);
    }
};
// Execute the kernel, create grid mapped to output buffer 'A'
device::execute(A.size(), vectorAdditionKernel);
// Test:
// We expect 'A' to be 1+2
bool ok = true;
for(int j = 0; j < N; j++)
{
    // Read 'A' values directly from host side code
    if( A(j) != 3)
    {
        ok = false;
        break;
    }
}
std::cout << "\n Status: " << (ok ? "PASS" : "FAIL") << "\n";

Mean filter using shared memory:

DeviceBuffer<float> in(rows,cols);
DeviceBuffer<float> out(in.size());

in.for_each([=]__device__(float& val){val=1.0f;});
out.for_each([=]__device__(float& val){val=0.0f;});

// define grid, map to output size
auto grid = DeviceGrid(out.size());
// Setup shared memory, block stencil defines access to neighbooring elements:
Smem<float> smem( grid, BlockStencil(1,1));
//
// Setup vector increment kernel:
auto meanFilter = xpl_device_lambda()
{
    // Row/col indices (i,j)
    int i = device::getY();
    int j = device::getX();
    // Cache input data in shared memory
    smem.cache(in);
    //
    // Compute mean in 3x3 neighboorhood
    float sum = 0.0f;
    for(int y : {-1,0,1})
        for(int x : {-1,0,1})
            sum += smem(threadIdx.y + y, threadIdx.x + x);

    float mean = sum/9.0f;
    // output
    if( i < out.rows() && j < out.cols())
        out(i,j) = mean;

};
// Execute the kernel, create grid mapped to output buffer 'A'
device::execute(grid, meanFilter);

FAQ

Isn’t this just like Thrust?

See comparisons between Thrust and SYCL (OpenCL) in the intro PDF. While there is some feature overlap they are not designed for the same thing and there is a long list of differing features. Basically:

Xtream is aimed at people who need to write CUDA kernels and more customizable code.
Use Thrust whenever your problem fits the thrust STL like iterators and algorithms.
Integration between Xtream/Thrust/CUDA is simple.

slovak194 · January 17, 2019, 9:26am

Sounds interesting!

Isn’t this just like Thrust?

What about a comparison to CUB (which is more low level than Thrust)?

I do not know the reason why it is placed on Gitlab (Microsoft?), but at least the mirror on Github will definitely give more followers to this project.

Jimmy_Pettersson · January 17, 2019, 2:44pm

As you’ve noted, CUB:s scope is generally more low level (block, and warp level primitives), the device-wide primitives of CUB do seem to slightly overlap Thrust (ex reduce, scan, sort).

I frequently use CUB code as part of my Xtream kernels for block and warp scoped operations.

I was unaware of there being a github mirror? There were several advantages with going with gitlab (price and features) at the time of that decision… I do think github (which is now owned by M$) might be catching up again.

Thanks for the feedback!

PS
There are a lot of updates in the pipeline WRT improved build system and features

Topic		Replies	Views
an easy-to-use CUDA library CUDA Programming and Performance	28	6829	January 29, 2012
Thrust v1.1 release A high-level C++ template library for CUDA CUDA Programming and Performance	6	13910	September 18, 2009
Thrust v1.2 release A high-level C++ template library for CUDA CUDA Programming and Performance	10	9346	December 14, 2010
Anyone working on STL for CUDA? CUDA Programming and Performance	7	10553	December 11, 2010
Thrust v1.3 release C++ Template Library for CUDA CUDA Programming and Performance	1	3164	October 5, 2010
A general C++ CUDA integration framework example I present a simple C++ integration CUDA Programming and Performance	6	4227	August 10, 2008
Thrust v1.0 release A high-level C++ template library for CUDA CUDA Programming and Performance	11	16947	May 30, 2009
A CUDA/C++ framework Help @ memory management CUDA Programming and Performance	2	2398	April 5, 2008
C++ support for STL containers in device code and memory CUDA Programming and Performance	11	14427	December 11, 2010
How to run Cuda with cub Jetson Xavier NX opencv , cuda , ubuntu	7	796	October 6, 2022

Xtream: A speedy CUDA C++ productivity framework

Related topics