Performance Implications for Passing Structs/Classes to Kernel by value/reference

Hi, this seems like an old (and naive?) question, but I cannot find a satisfactory answer online.

Suppose I have a class/struct. Its primarily goal is as a utility class to handle various operations inside the kernel. It has both host side functions and device side functions. Its members are fairly simple (a couple of integers), and its constructor also take very simple arguments (say, integers).

My question is whether there are any performance implications between:

  1. Instantiate the object on host, and pass it to the kernel by value/reference.
  2. Instantiate the object on device, and passing the constructor’s arguments to the kernel by value/reference.

Unless your kernels are trivially small, there shouldn’t be an important performance consideration here between choice 1 and 2. Do whichever seems more sensible for your code.

If you have virtual methods, the question takes on a different character, but that is not really related to performance per se.

Thank you for the quick response!

As a general rule of thumb, are there any important considerations when passing arbitrary objects to the kernel (aside from virtual methods)?

I’ve never run across any.

It’s understandable trying to head off obvious performance issues before they manifest, such as not enough GPU occupancy, or memory coalescing issues. However I don’t think there is any such obvious concern here.

In that case, my usual recommendation is to write code that makes sense to you and is maintainable, and then wait to tackle performance until you have something to test and a profiler at the ready.

Understand, thanks again for the helpful answers!

You should take special care, if the member variables contain pointers, references or other special values like system handles.

could you elaborate on what “special care” we should take?

Simply speaking, pointers, references, etc, to host memory are not magically changed to references to device memory, and accesses to those from within the kernel will segfault. You need to manually perform device-accessible deep-copies if required.

There is also Section 14.5.10.3.1. __global__ Function Argument Processing 1. Introduction — CUDA C Programming Guide in the programming guide which explains some limitations regarding the processing of kernel parameters on the host.

Special system handles, for example mutices created on the host side, also won’t work on the device side.

Best change your data structure not to use pointers and references. Either directly store copies of the values or indices into tables with all values. For example a graph data structure would store an adjacency list or an adjacency matrix with indices. Indices keep identity information intact and save memory, copies do not.

Got it, thanks both for the explanations!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.