Have written C++-ish wrappers for the CUDA Runtime API - wanna try them out?

Hello forum denizens,

I’ve just published a repository on GitHub:

cuda-api-wrappers: Thin C+±flavored wrappers for the CUDA runtime API

It’s a small header-mostly library, intended to allow for more C+±ish use of the CUDA Runtime API (which is very C-ish). Main features/design goals are:

  • Exceptions-with-status-data instead of returning status code.
  • Proxy objects for streams, devices, etc.
  • Judicious use of namespacing.
  • Straightforwardness and clarity in naming and semantics
  • Thin and very lightweight - mostly vanishes when compiling
  • Doesn't force any rich abstractions on you - merely exposes the runtime API in a different way
  • No work done behind your back

The library is not specific to my own research (which involves use of GPUs in DBMSes) but is basically general-purpose; I haven’t seen something similar to it so far so I’m hoping more than a few people might find it useful.

Now, beside the shameless pitch here, I would also very much appreciate feedback from people, both on the design decisions and on potential invalid/buggy behavior. I have been using this code for a while now, but I don’t actually need all of the Runtime API so some of the code is not well-tested.

Hi!

I like the use of C++11 (auto, unique_ptr) etc. in the modified samples.

I was wondering if it would make sense to come up with specialized smart pointers that would be able to deal with special cuda memory allocations, such as returned by cuda::memory::host::allocate This could be done by wrapping a std::shared_ptr and providing a special deallocator.

I will surely keep an eye on this project.

Have an upvote… ah darn this isn’t reddit.

Christian

cbuchner1:
Well, I do have variants of std::unique_ptr for pinned host memory and for global device memory (but not for mapped region pairs) - it’s under cuda/api/unique_ptr.hpp. It’s not as elegant as it might be, but it works for trivially-constructible types. You can allocate either a single element or specify the number of elements. You can see one kind of these in “action” here.

As for the use of C++11 - I actually use C++11 features rather sparingly - not for lack of desire, there’s just not so much you can do with them for API wrappers. Other parts of my code - like a collection of primitives, kind of an expansion wider and down of what cub offers at the warp level - has much more C++11.

Finally - instaled of upvoting here you can star me on GitHub :-)

cbunchner1:
I should probably add there is the potential use of C++ lambda’s, or anything which fits an std::function, for callbacks. So, for example, this works:

auto my_stream = cuda::device::current::get().create_stream(
	cuda::stream::default_priority + 1,
	cuda::stream::no_implicit_synchronization_with_default_stream);
my_stream.enqueue_callback(
	[&foo](cuda::stream::id_t stream_id, cuda::status_t status) {
		std::cout << "Hello " << foo << " world!\n";
	}
);

Could I get an example of how to use Managed Memory and a non-default stream? I am using a Jetson Xavier which has a shared GPU memory and CPU memory. I was initially using zero copy, but switched to using Managed Memory per recommendation here:

Then I have a pipeline of kernels based on this recommendation:

Overall, I want the cuda-api-wrapper equivalent of:

cudaMallocManaged( (void**)input_raw_ptr, num_bytes, cudaMemAttachHost );
cudaMalloc( (void**)intermediate_data1_raw_ptr, num_immed_bytes );
cudaMalloc( (void**)intermediate_data2_raw_ptr, num_immed_bytes );
cudaMallocManaged( (void**)output_raw_ptr, num_output_bytes );

// No cudaMemcpy HtoD needed.

my_kernel1<<< blocks_per_thread, threads_per_block, 0, stream >>>( immed_data1_raw_ptr, input_raw_ptr, num_vals );
my_kernel2<<< blocks_per_thread, threads_per_block, 0, stream >>>( immed_data2_raw_ptr, immed_data1_raw_ptr, num_immed_vals );
my_kernel3<<< blocks_per_thread, threads_per_block, 0, stream >>>( output_raw_ptr, immed_data2_raw_ptr, num_output_vals );

cudaStreamAttachMemAsync(stream, output_raw_ptr, 0, cudaMemAttachHost);
cudaStreamSynchronize(stream);

// No cudaMemcpy DtoH needed.

// Use output_raw_ptr data in host code

Hello Hazelnutv04,

I think this merits a separate forum thread, but - I’ll oblige. Let’s assume your data all has type float. Now, you’d write

namespace cm = cuda::memory;
auto input = cm::managed::make_unique<float[]>(num_vals);
auto intermediate_1 = cm::device::make_unique<float[]>(num_intermediate_elements);
auto intermediate_2 = cm::device::make_unique<float[]>(num_intermediate_elements);
auto output = cm::managed::make_unique<float[]>(num_output_vals);

auto launch_config = cuda::make_launch_config(blocks_per_grid, threads_per_block, cuda::no_shared_memory);
stream.enqueue.kernel_launch(my_kernel1, launch_config, immediate_1.get(), input.get(), num_vals);
stream.enqueue.kernel_launch(my_kernel2, launch_config, immediate_2.get(), intermediate_1.get(), num_intermediate_vals);
stream.enqueue.kernel_launch(my_kernel3, launch_config, output.get(), intermediate_2.get(), num_output_vals);

// Whoops! This will only be supported in v0.4.0 :
stream.enqueue.memory_attachment(output.get(), cm::managed::attachment_t::host);
// and for now, I only have a method for cudaMemAttachSingle :-( ... so you would need to write:
// cudaStreamAttachMemAsync(stream.id(), output.get(), 0, cudaMemAttachHost);
stream.synchronize();

See this resolved issue about the missing kind of attachment.

Note that, unlike in your example - here you won’t have to free the memory or destroy the stream explicitly. The price is the annoying .get() suffixes to the use of the unique pointers.