I’m making our CUDA C++ productivity framework, Xtream, publicly available. It’s essentially a thin layer of C++11 features on top of CUDA, simplifying kernel writing and memory management.
A brief introduction can be found in these slides https://gitlab.com/jipe4153/xtream/blob/master/doc/xtream_intro.pdf
The source code can be found here:
# Make sure you have a gitlab user with SSH keys added!
git clone git@gitlab.com:jipe4153/xtream.git
cd xtream
# build tests & samples
./build.sh
# doxygen doc
cd build
make doc
The code is completely header-based so compilation is only needed for the tests and sample applications.
Xtream has been of great use to a lot of people internally over the years and I think a community approach is the way to go.
Some code snippets:
Vectoraddition:
int N = 1024*256;
// Vectors A,B,C
DeviceBuffer<int> A(N); // Create 1xN vector
DeviceBuffer<int> B(N);
DeviceBuffer<int> C(N);
// Set B to 1 and C to '2' using for each lambda call
B.for_each([=]__device__(int& val){val=1;});
C.for_each([=]__device__(int& val){val=2;});
//
// Setup vector addition kernel:
auto vectorAdditionKernel = xpl_device_lambda()
{
// Index in X-dimension
int j = device::getX(); // same as threadIdx.x + blockIdx.x*blockDim.x
if( j < A.cols() )
{
A(j) = B(j) + C(j);
}
};
// Execute the kernel, create grid mapped to output buffer 'A'
device::execute(A.size(), vectorAdditionKernel);
// Test:
// We expect 'A' to be 1+2
bool ok = true;
for(int j = 0; j < N; j++)
{
// Read 'A' values directly from host side code
if( A(j) != 3)
{
ok = false;
break;
}
}
std::cout << "\n Status: " << (ok ? "PASS" : "FAIL") << "\n";
Mean filter using shared memory:
DeviceBuffer<float> in(rows,cols);
DeviceBuffer<float> out(in.size());
in.for_each([=]__device__(float& val){val=1.0f;});
out.for_each([=]__device__(float& val){val=0.0f;});
// define grid, map to output size
auto grid = DeviceGrid(out.size());
// Setup shared memory, block stencil defines access to neighbooring elements:
Smem<float> smem( grid, BlockStencil(1,1));
//
// Setup vector increment kernel:
auto meanFilter = xpl_device_lambda()
{
// Row/col indices (i,j)
int i = device::getY();
int j = device::getX();
// Cache input data in shared memory
smem.cache(in);
//
// Compute mean in 3x3 neighboorhood
float sum = 0.0f;
for(int y : {-1,0,1})
for(int x : {-1,0,1})
sum += smem(threadIdx.y + y, threadIdx.x + x);
float mean = sum/9.0f;
// output
if( i < out.rows() && j < out.cols())
out(i,j) = mean;
};
// Execute the kernel, create grid mapped to output buffer 'A'
device::execute(grid, meanFilter);
FAQ
- Isn’t this just like Thrust?
See comparisons between Thrust and SYCL (OpenCL) in the intro PDF. While there is some feature overlap they are not designed for the same thing and there is a long list of differing features. Basically:
- Xtream is aimed at people who need to write CUDA kernels and more customizable code.
- Use Thrust whenever your problem fits the thrust STL like iterators and algorithms.
- Integration between Xtream/Thrust/CUDA is simple.