About Using Tuples in Device Code

I am creating a generic interface of standardized parallel algorithms within our own group, including stream compaction and reordering. People may ask me why I want to do so with the existence of thrust. To clarify that, it is my understanding that thrust uses cudaMalloc(Async), which is not ideal for product development with consideration of throughput. In each GPU project, a memory pool has to be implemented to avoid repetitive communication with GPU driver. I would like to use our own memory pool in implementing the interface. Please feel free to correct me if my understanding is incorrect about thrust.

It is common sense (I hope it is indeed common sense) that in GPU projects, we prefer data structure of arrays (DoA) over array of data structure (AoD). With this consideration, we need to pass a tuple of pointers into a CUDA kernel. This is why I need to use tuple data structure in my implementation. Then I am confused: can I use std::tuple directly in a CUDA kernel/device function? I know that there is a cuda::std library, which also has tuple implementation. I had been taking this cuda::std as a mirror of std, until my compiler complained the following:

../../Isrc/lithas/cuda/cuda_mrc_util.h(479): error: namespace "cuda::std" has no member "tuple_cat"

../../Isrc/lithas/cuda/cuda_mrc_util.h(479): error: namespace "cuda::std" has no member "make_tuple"

from the following function:

template <typename InsertionType, typename TupleType>

__host__ __device__ auto append_to_end(const InsertionType& elem, const TupleType& tuple)

{

return cuda::std::tuple_cat(tuple, cuda::std::make_tuple(elem));

}

Then I noticed that through replacing cuda::std by std as follows:

template <typename InsertionType, typename TupleType>
__host__ __device__ auto append_to_end(const InsertionType& elem, const TupleType& tuple)
{
  return std::tuple_cat(tuple, std::make_tuple(elem));
}

I was able to build successfully. Here is why I am confused: how much can I trust std for CUDA operations on c++ containers, and when I should use cuda::std, and what is the difference between cuda::std and std.

I don’t have any trouble using cuda::std::make_tuple() or cuda::std::tuple_cat() in device code in CUDA 11.2 or newer.

NVIDIA doesn’t really provide a list of “what seems to work” from the C++ standard library. The general statement is given here (standard library not supported in device code unless stated exception - and I don’t think you’ll find a stated exception for std::tuple) but obviously various things “seem to work”. It’s not limited to tuples. Other aspects of the standard library “seem to work” in CUDA device code. But with respect to std::tuple, see here.

see here.

Whether or not a standard library item works in device code is probably going to depend to some degree on how the standard library headers look and what sort of code they result in. If it is all templated stuff that makes no calls into precompiled libraries, its more likely that it might “just work” (because it results in C++ “ordinary code” that the device compiler can handle). If it makes a call into a precompiled library, I think there is no chance of it working. And since that sort of stuff is all implementation detail, it cannot be relied on remaining the same from one version of the library to the next. So there is probably little point in spelling out what “seems to work”, with respect to the standard library.

Even that description is not air-tight. Things that you might think surely must call a compiled library end up working.

It is your choice what to use, of course.

This is what I got:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0

great, did you try compiling the code I put in the godbolt link, verbatim, in your setup? If so, what result did you get?

If you got a failure, you may have a corrupted or invalid environment.

I would argue that thrust is perfectly suited for product development. It is part of the official CUDA toolkit, tested, and actively maintained. Its algorithms are fast and it is used in other nvidia products like cuDF, just to name one.

Plain thrust can have some limitations regarding unwanted synchronization. However, there are workarounds which come with thrust directly, or which are provided by other nvidia libraries.

  • Unnecessary synchronization of algorithms can be avoided by using thrust::cuda::par_nosync execution policy
  • Unnecessary synchronization from device vector construction can be avoided by using a different container library
  • (Synchronization-) overhead caused by memory allocation can be tackled by using a custom allocator

Specifically, RMM provides stream-ordered memory resources and a stream-ordered uninitialized device vector.

rmm::mr::device_memory_resource* mr = ... //could be for example a binned allocator which uses different memory resources for different allocation sizes
rmm::device_uvector<int> d_vec(1024, stream, mr); //elements are uninitialized, no extra kernel, no stream / device sync
...
thrust::sort(
   rmm::exec_policy_nosync(stream, mr), // the equivalent of thrust::cuda::par_nosync with rmm memory
   d_vec.begin(), d_vec.end());

1 Like

Thanks for your input. I am glad that people are correcting my view of thrust, which is what I wanted.

My only concern for thrust is its memory allocator, and all I want to avoid is memory allocation through communication with GPU driver. In my past experience, we used a pre-allocated memory pool, and every time we allocate memory through our API, it is only getting a chunk of memory from existing memory pool. In that sense, personally, I like cub, as it is very flexible in memory allocation. The first time a device routine is called, it returns a size of buffer, and it is the responsibility of the user to provide a pointer that has this suggested size of buffer.

Is rmm::mr::device_memory_resource* mr the way to use an existing memory allocator utilizing a memory pool? If that is the case, then maybe by default there should be a member function allocate that needs to be called to return a device pointer within thrust implementation? To be honest, I really doubt on this. The purpose of providing a stream is to have the memory ready before kernel launch. There is no need for such sync if memory has been allocated. In that sense, I really doubt the support of thrust on user-provided memory pool.

From my profiling experience, when kernels are highly optimized and a good GPU infra is built (to improve overall GPU utilization), cudaMalloc(Async) is the most time-consuming step of an application, and repetitive call to this function can destroy the overall throughput of a heterogeneous system (thinking about software systems running in a fab. Throughput directly impacts yield).

I hope my opinion on thrust is wrong, as in that sense, I don’t have to write a whole interface of my own…


Somehow, after reading the doc, I may need to give thrust a second thought. Let me do some tests and come back with my conclusion.

Just verified that our inhouse version is 11.0. Asking for upgrade. Thanks!