Have written C++-ish wrappers for the CUDA Runtime API - wanna try them out?

epk · November 13, 2016, 12:15am

Hello forum denizens,

I’ve just published a repository on GitHub:

cuda-api-wrappers: Thin C+±flavored wrappers for the CUDA runtime API

It’s a small header-mostly library, intended to allow for more C+±ish use of the CUDA Runtime API (which is very C-ish). Main features/design goals are:

Exceptions-with-status-data instead of returning status code.
Proxy objects for streams, devices, etc.
Judicious use of namespacing.
Straightforwardness and clarity in naming and semantics
Thin and very lightweight - mostly vanishes when compiling
Doesn't force any rich abstractions on you - merely exposes the runtime API in a different way
No work done behind your back

The library is not specific to my own research (which involves use of GPUs in DBMSes) but is basically general-purpose; I haven’t seen something similar to it so far so I’m hoping more than a few people might find it useful.

Now, beside the shameless pitch here, I would also very much appreciate feedback from people, both on the design decisions and on potential invalid/buggy behavior. I have been using this code for a while now, but I don’t actually need all of the Runtime API so some of the code is not well-tested.

cbuchner1 · November 14, 2016, 9:58am

Hi!

I like the use of C++11 (auto, unique_ptr) etc. in the modified samples.

I was wondering if it would make sense to come up with specialized smart pointers that would be able to deal with special cuda memory allocations, such as returned by cuda::memory::host::allocate This could be done by wrapping a std::shared_ptr and providing a special deallocator.

I will surely keep an eye on this project.

Have an upvote… ah darn this isn’t reddit.

Christian

epk · November 14, 2016, 6:02pm

cbuchner1:
Well, I do have variants of std::unique_ptr for pinned host memory and for global device memory (but not for mapped region pairs) - it’s under cuda/api/unique_ptr.hpp. It’s not as elegant as it might be, but it works for trivially-constructible types. You can allocate either a single element or specify the number of elements. You can see one kind of these in “action” here.

As for the use of C++11 - I actually use C++11 features rather sparingly - not for lack of desire, there’s just not so much you can do with them for API wrappers. Other parts of my code - like a collection of primitives, kind of an expansion wider and down of what cub offers at the warp level - has much more C++11.

Finally - instaled of upvoting here you can star me on GitHub :-)

epk · November 17, 2016, 2:42pm

cbunchner1:
I should probably add there is the potential use of C++ lambda’s, or anything which fits an std::function, for callbacks. So, for example, this works:

auto my_stream = cuda::device::current::get().create_stream(
	cuda::stream::default_priority + 1,
	cuda::stream::no_implicit_synchronization_with_default_stream);
my_stream.enqueue_callback(
	[&foo](cuda::stream::id_t stream_id, cuda::status_t status) {
		std::cout << "Hello " << foo << " world!\n";
	}
);

hazelnutvt04 · July 2, 2020, 2:20pm

Could I get an example of how to use Managed Memory and a non-default stream? I am using a Jetson Xavier which has a shared GPU memory and CPU memory. I was initially using zero copy, but switched to using Managed Memory per recommendation here:

Then I have a pipeline of kernels based on this recommendation:

Overall, I want the cuda-api-wrapper equivalent of:

cudaMallocManaged( (void**)input_raw_ptr, num_bytes, cudaMemAttachHost );
cudaMalloc( (void**)intermediate_data1_raw_ptr, num_immed_bytes );
cudaMalloc( (void**)intermediate_data2_raw_ptr, num_immed_bytes );
cudaMallocManaged( (void**)output_raw_ptr, num_output_bytes );

// No cudaMemcpy HtoD needed.

my_kernel1<<< blocks_per_thread, threads_per_block, 0, stream >>>( immed_data1_raw_ptr, input_raw_ptr, num_vals );
my_kernel2<<< blocks_per_thread, threads_per_block, 0, stream >>>( immed_data2_raw_ptr, immed_data1_raw_ptr, num_immed_vals );
my_kernel3<<< blocks_per_thread, threads_per_block, 0, stream >>>( output_raw_ptr, immed_data2_raw_ptr, num_output_vals );

cudaStreamAttachMemAsync(stream, output_raw_ptr, 0, cudaMemAttachHost);
cudaStreamSynchronize(stream);

// No cudaMemcpy DtoH needed.

// Use output_raw_ptr data in host code

epk · July 5, 2020, 8:39pm

Hello Hazelnutv04,

I think this merits a separate forum thread, but - I’ll oblige. Let’s assume your data all has type float. Now, you’d write

namespace cm = cuda::memory;
auto input = cm::managed::make_unique<float[]>(num_vals);
auto intermediate_1 = cm::device::make_unique<float[]>(num_intermediate_elements);
auto intermediate_2 = cm::device::make_unique<float[]>(num_intermediate_elements);
auto output = cm::managed::make_unique<float[]>(num_output_vals);

auto launch_config = cuda::make_launch_config(blocks_per_grid, threads_per_block, cuda::no_shared_memory);
stream.enqueue.kernel_launch(my_kernel1, launch_config, immediate_1.get(), input.get(), num_vals);
stream.enqueue.kernel_launch(my_kernel2, launch_config, immediate_2.get(), intermediate_1.get(), num_intermediate_vals);
stream.enqueue.kernel_launch(my_kernel3, launch_config, output.get(), intermediate_2.get(), num_output_vals);

// Whoops! This will only be supported in v0.4.0 :
stream.enqueue.memory_attachment(output.get(), cm::managed::attachment_t::host);
// and for now, I only have a method for cudaMemAttachSingle :-( ... so you would need to write:
// cudaStreamAttachMemAsync(stream.id(), output.get(), 0, cudaMemAttachHost);
stream.synchronize();

See this resolved issue about the missing kind of attachment.

Note that, unlike in your example - here you won’t have to free the memory or destroy the stream explicitly. The price is the annoying .get() suffixes to the use of the unique pointers.

Topic		Replies	Views
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	798	September 11, 2016
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204313	April 13, 2009
Can I create a pinned memory buffer to support overlapping compute/copy without cudaMallocHost overhead CUDA Programming and Performance cuda	13	772	November 3, 2020
Inferior Results on C2070 when Using Streams CUDA Programming and Performance	3	1927	March 6, 2012
CUDA thread in background? CUDA Programming and Performance	10	15979	February 19, 2010
Neural network on GPU, physics on CPU? CUDA Programming and Performance	13	5648	October 6, 2013
Should I program with Driver API? newbie here CUDA Programming and Performance	8	2230	July 20, 2010
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1170	May 14, 2019
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2169	January 18, 2023
c++ Project file management using CUDA Where to put the CUDA code? CUDA Programming and Performance	3	3061	August 30, 2011

Have written C++-ish wrappers for the CUDA Runtime API - wanna try them out?

Related topics