Performance of passing structs to kernel by value / by reference

Hi all,

I recently started to use unified memory, because it seems to offer the cleanest way to pass any struct containing device pointers to a kernel by reference.
I wrote a small program (later described in more detail) to replicate my current project in the simplest way possible and to measure the performance using unified memory.

The result was that the performance of my test kernel would drop by about 20-25% if the struct was passed by reference, more or less independent of the data size. The same happens if I pass the struct via pointer.
Even though the test kernel may not represent a real scenario, 25% seems a lot to me and so far I could not find a flaw in my testing. Of course passing struct containing pointer would end up in a big mess so I would like to avoid it. I’m using a pascal Quadro p1000m.

So my question is, as to why that is and maybe how to avoid it, since I could not find anything on this topic?

This is how I measured the performance:
I implemented two structs, one called buffer which contains a device pointer, pointing to unified memory and the other called parent - containing two buffer objects.

I used a kernel, which assigns the output of a small test function for each element in one buffer to the corresponding element in the other.

Finally, I compared the performance when passing the struct by value/ by reference to the same kernel only using two float arrays of the same size.
Copy constructor and destructor are commented out on purpose.

Many thanks and all the best from munich!

Here is my code (I just changed back and forth between value, reference and pointer):

#include "cuda_runtime.h"
#include <iostream>
#include <chrono>
class Managed {
public:
  void *operator new(size_t len) {
    void *ptr;
    cudaMallocManaged(&ptr, len);
    cudaDeviceSynchronize();
    return ptr;
  }

  void operator delete(void *ptr) {
    cudaDeviceSynchronize();
    cudaFree(ptr);
  }
};

#define N  100000000
constexpr int block_size = 256;
constexpr int grid_size = (N + block_size -1 ) / block_size;
constexpr int num_iter = 10;

struct buffer : Managed
{

    int m_size;
    float* m_pointer;
    buffer(): m_pointer(nullptr), m_size(0){}
    buffer(int size): m_pointer(nullptr), m_size(size)
    {
        cudaMallocManaged(&m_pointer, m_size * sizeof(float));
    }

    // buffer (const buffer &rhs) {
    //     m_size = rhs.m_size;
    //     cudaMallocManaged(&m_pointer, m_size*sizeof(float));
    //     memcpy(m_pointer, rhs.m_pointer, m_size*sizeof(float));
    //   }

    ~buffer()
    {
       // cudaFree(m_pointer);
    }
};

struct parent : Managed
{
    buffer buffer1;
    buffer buffer2;
    parent(int size): buffer1(size),buffer2(size)  {}
};

__device__ __forceinline__ float test_func(float x)
{
    return sqrtf(x) / expf(x) * 10 + 100;
}

__global__ void kernel(float *arr,float* arr_dst)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x ;
    if (idx >= N){return;}
    arr_dst[idx] = test_func(arr[idx]) ;
}

__global__ void kernel(parent& instance)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x ;
    if (idx >= N){return;}

    instance.buffer1.m_pointer[idx] = test_func(instance.buffer2.m_pointer[idx]) ;
}

int main()
{

parent* buf = new parent(N);
    float * arr, *arr_dst;
    cudaMallocManaged(&arr, N * sizeof(float));
    cudaMallocManaged(&arr_dst, N * sizeof(float));

    kernel<<<grid_size,block_size>>>(*buf);
    kernel<<<grid_size,block_size>>>(arr, arr_dst, N);
    cudaDeviceSynchronize();

    auto start = std::chrono::high_resolution_clock::now();
    for ( int i = 0; i< num_iter; ++i){
         kernel<<<grid_size,block_size>>>(arr,arr_dst, N);
         cudaDeviceSynchronize();
    }
    
    auto stop = std::chrono::high_resolution_clock::now();
    auto elapsed =  std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();

    start = std::chrono::high_resolution_clock::now();
   for ( int i = 0; i< num_iter; ++i){
       kernel<<<grid_size,block_size>>>(*buf);
       cudaDeviceSynchronize();
   }
    stop = std::chrono::high_resolution_clock::now();
   auto elapsed1 =  std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();

std::cout << "struct by value/reference "<< elapsed1 <<  " arrays "<< elapsed << " " << (double)elapsed1 / (double)elapsed << std::endl;
    cudaFree(arr);
    cudaFree(arr_dst);
    cudaFree(buf->buffer1.m_pointer);
    cudaFree(buf->buffer2.m_pointer);
    cudaFree(buf);
    return 0;
}

A side note on performance measurement, you should change how you synchronize in the loop to:

for ( int i = 0; i< num_iter; ++i){
       kernel<<<grid_size,block_size>>>(*buf);

   }
cudaDeviceSynchronize();

Otherwise you might/will be measuring synchronization overhead aswell which is not the point.

You should also consider adding CUDA event timers for actually timing the kernel execution time. I’m guessing with your current timers you will catch [kernel time] + [host/device transfer time].

Hi Jimmy,

thanks for your fast reply, you are totally right.
I put the synchronization outside the loop and used nvprof, 1000 iterations and 10 mio threads.
The result is still the same:

pass by value:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 50.01% 1.77032s 1001 1.7686ms 1.7575ms 6.6504ms kernel(float*, float*, int)
GPU activities: 49.99% 1.76977s 1001 1.7680ms 1.7542ms 10.032ms kernel(parent)

by reference:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 54.96% 2.15780s 1001 2.1556ms 2.1377ms 12.535ms kernel(parent&)
GPU activities: 45.04% 1.76839s 1001 1.7666ms 1.7511ms 6.3279ms kernel(float*, float*, int)

via pointer:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 54.90% 2.15301s 1001 2.1509ms 2.1368ms 7.9893ms kernel(parent*)
GPU activities: 45.10% 1.76837s 1001 1.7666ms 1.7554ms 6.9761ms kernel(float*, float*, int)

In the general case, when you switch code from cudaMalloc/cudaMemcpy to just cudaMallocManaged, in a pascal or later UM regime, you have switched to demand-paged transfer of data (as opposed to bulk or en-masse transfer of data).

This pushes the cost of transferring the data into the kernel duration. It also can become a less efficient way to transfer the data, depending on code specifics.

So your observations are typical/not surprising.

This blog article discusses this and also talks about possible ways to adjust behavior:

[url]https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/[/url]

The key takeaway is use of cudaMemPrefetchAsync to “restore” the original behavior/efficiency with cudaMalloc/cudaMemcpy

note that in the case where you overload new and potentially have a bunch of UM-allocated objects, the process of attempting to move everything to the GPU to avoid the inefficiencies associated with demand-paged transfer may be less practical.

Thanks for your answer!
I tried to compensate for the memory transfer by launching the kernel once before I start to measure the time. I thought that this would prevent the data transfer for all future launches?
Sadly I’m bound to windows on that project, which gave me an error when I tried to use cudaMemPrefetchAsync(), which apparently is only available for linux right now?

I did not see a increase in execution time when I switched from “normal” to unified memory (the float arrays are also unified). However, just when I switched from passing by value to passing by reference/pointer.

all the best,
Glenn

OK I was assuming linux. Windows does not support demand-paging in CUDA 9/10

So even though you have a pascal device, you are in a pre-pascal UM “regime”. That means data transfer will be en-masse, at the point of kernel call. If you have previously called cudaDeviceSynchronize() for example after a previous kernel call, then the data transfer will still be incurred at the point of the next kernel call.

The only way to use data that has already been migrated would be to call your kernel as a warm-up, then call the kernel again, without any intervening cudaDeviceSynchronize()

It’s more difficult to use host-based timing to see what is going on here, so the usual suggestion would be to explore this with a profiler.

“The only way to use data that has already been migrated would be to call your kernel as a warm-up, then call the kernel again, without any intervening cudaDeviceSynchronize()” that is exactly what I did.

This is roughly the execution order, when I last measured the time:

kernel<<<grid_size,block_size>>>(*buf);
    kernel<<<grid_size,block_size>>>(arr, arr_dst, N);
    cudaDeviceSynchronize();

    for ( int i = 0; i< num_iter; ++i){
         kernel<<<grid_size,block_size>>>(arr,arr_dst, N);

    }
    cudaDeviceSynchronize();

"It’s more difficult to use host-based timing to see what is going on here, so the usual suggestion would be to explore this with a profiler. "

The timings I provided in my second post were created by nvprof:
pass by value:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 50.01% 1.77032s 1001 1.7686ms 1.7575ms 6.6504ms kernel(float*, float*, int)
GPU activities: 49.99% 1.76977s 1001 1.7680ms 1.7542ms 10.032ms kernel(parent)

by reference:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 54.96% 2.15780s 1001 2.1556ms 2.1377ms 12.535ms kernel(parent&)
GPU activities: 45.04% 1.76839s 1001 1.7666ms 1.7511ms 6.3279ms kernel(float*, float*, int)

At the moment I’m not able to explain the discrepancy. As the profiler is saying, the performance difference is in the kernel execution itself. I’m able to reproduce the result when running on a K20. I witness about a 20% difference in kernel execution time between those two cases, which is approximately the difference depicted above.

I’ve done a bit of study of the SASS code. Overall the SASS code for both kernels looks quite similar. I’m not able to discover what the key difference may be.

A curious observation I made is that if I run the code on a V100, the performance difference between the two kernels drops to about 2-3%, where the struct-by-reference kernel is a bit longer than the raw-pointer kernel. This seems plausible to me, as the struct-by-reference may be introducing an extra pointer lookup/dereference.

Its possible that there is an architecture-dependent code generation issue here, but I’m not able to spot it with a few minutes of study of the SASS code. It’s evident that the floating point routines for sqrtf and expf are the same, as these are broken out by the compiler as subroutines with a call/return structure (albeit “inlined”).

As an aside, the code posted in the first posting in this thread doesn’t compile unless some trivial changes are made.

Interesting profiling data, there may be a clue here:

$ CUDA_VISIBLE_DEVICES="1" nvprof --metrics gld_transactions,gst_transactions ./t1507
==25550== NVPROF is profiling process 25550, command: ./t1507
==25550== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "kernel(parent&, int)" (done)
Replaying kernel "kernel(float*, float*, int)" (done)
Replaying kernel "kernel(float*, float*, int)" (done)
Replaying kernel "kernel(float*, float*, int)" (done)
Replaying kernel "kernel(parent&, int)" (done)
Replaying kernel "kernel(parent&, int)" (done)
struct by value/reference 538669 arrays 528877 1.01851
==25550== Profiling application: ./t1507
==25550== Profiling result:d_transaction
==25550== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla K20Xm (0)"
    Kernel: kernel(parent&, int)
          3                          gld_transactions                  Global Load Transactions     9375000     9375000     9375000
          3                          gst_transactions                 Global Store Transactions     3125000     3125000     3125000
    Kernel: kernel(float*, float*, int)
          3                          gld_transactions                  Global Load Transactions     3125000     3125000     3125000
          3                          gst_transactions                 Global Store Transactions     3125000     3125000     3125000
$ CUDA_VISIBLE_DEVICES="0" nvprof --metrics gld_transactions,gst_transactions ./t1507
==25571== NVPROF is profiling process 25571, command: ./t1507
==25571== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "kernel(parent&, int)" (done)
Replaying kernel "kernel(float*, float*, int)" (done)
Replaying kernel "kernel(float*, float*, int)" (done)
Replaying kernel "kernel(float*, float*, int)" (done)
Replaying kernel "kernel(parent&, int)" (done)
Replaying kernel "kernel(parent&, int)" (done)
struct by value/reference 62084 arrays 63050 0.984679
==25571== Profiling application: ./t1507
==25571== Profiling result:
==25571== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Tesla V100-PCIE-32GB (0)"
    Kernel: kernel(parent&, int)
          3                          gld_transactions                  Global Load Transactions    18750000    18750000    18750000
          3                          gst_transactions                 Global Store Transactions    12500000    12500000    12500000
    Kernel: kernel(float*, float*, int)
          3                          gld_transactions                  Global Load Transactions    12500000    12500000    12500000
          3                          gst_transactions                 Global Store Transactions    12500000    12500000    12500000
$

Thank you for the detailed answer!

Ok, if any, my assumption was also that dereferencing might play a role, but I thought that this is somehow dealt with by the compiler.

“A curious observation I made is that if I run the code on a V100, the performance difference between the two kernels drops to about 2-3%” -The V100 also has a much higher mem-bandwidth. Could that also lead to a reduced impact?

" Interesting profiling data, there may be a clue here:"
So you are suggesting that the additional gld operations are due to loading the pointer from global memory?

I think my main problem is that I cannot make an assumption of how much performance loss can be expected in a “real world” kernel. As you said, if the kernel is more complex the difference vanished.
However, in my case- I’m doing something similar to gvdb where I might have to traverse down the tree and fetch a node from global memory each time - I don’t know how that impacts performance.
Do you know if the dereferencing-induced-loading has to be done every time I would like to access the buffer or just once?
Moreover, what happens if I have more than just two “layer” of classes? This should not induce longer run-times right?

What would you do in terms of using by reference/ by value?

" As an aside, the code posted in the first posting in this thread doesn’t compile unless some trivial changes are made. " - sorry for that by the way

All the best and many thanks from munich,
Glenn

For me the ratio of gld’s is even higher (5:1) compared to (3:1). I’m really confused as to why there are so many mode load operations? naively i would think that maybe one more additional load operation is needed or maybe even less :

Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Quadro P1000 (0)"
    Kernel: kernel(parent&)
          1                          gld_transactions                  Global Load Transactions   250000002   250000002   250000002
          1                          gst_transactions                 Global Store Transactions    12500000    12500000    12500000
    Kernel: kernel(buffer&, buffer&)
          1                          gld_transactions                  Global Load Transactions   250000002   250000002   250000002
          1                          gst_transactions                 Global Store Transactions    12500000    12500000    12500000
    Kernel: kernel(float*, float*, int)
          1                          gld_transactions                  Global Load Transactions    50000002    50000002    50000002
          1                          gst_transactions                 Global Store Transactions    12500000    12500000    12500000

I set up another kernel, where N_load consecutive elements are summed up:

__global__ void kernel(float *arr,float* arr_dst, int n)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x ;
    if (idx >= N-N_load){return;}

    float sum = 0;
    for(int i = 0; i< N_load; ++i){
        sum +=test_func(arr[idx+i]) ;
    }
    arr_dst[idx] = sum;

}

__global__ void kernel(parent& instance)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x ;
    if (idx >= N-N_load){return;}

    float sum = 0;
    for(int i = 0; i< N_load; ++i){
        sum += test_func(instance.buffer2.m_pointer[idx+i]) ;

    }
    instance.buffer1.m_pointer[idx] = sum;
}

__global__ void kernel(buffer& instance, buffer& instance_dst)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x ;
    if (idx >= N-N_load){return;}

    float sum = 0;
    for(int i = 0; i< N_load; ++i){
       sum+= test_func(instance.m_pointer[idx+i]) ;
    }
    instance_dst.m_pointer[idx] = sum;
    
}

Now the effect diminishes (N_load = 5):

Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "Quadro P1000 (0)"
    Kernel: kernel(parent&)
          1                          gld_transactions                  Global Load Transactions    45000002    45000002    45000002
          1                          gst_transactions                 Global Store Transactions     1250000     1250000     1250000
    Kernel: kernel(float*, float*, int)
          1                          gld_transactions                  Global Load Transactions    25000002    25000002    25000002
          1                          gst_transactions                 Global Store Transactions     1250000     1250000     1250000

and the timings are:

Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   50.02%  6.56994s      1000  6.5699ms  6.5615ms  6.7048ms  kernel(float*, float*, int)
                   49.98%  6.56597s      1000  6.5660ms  6.5603ms  6.5721ms  kernel(parent&)

If the kernel characteristics don’t change, I would expect mem bandwidth differences to have a ratiometric effect, i.e. that the 20% difference would be preserved. But the profiling data suggests to me that the kernel characteristics (number of load operations per element of data) DO change. I suspect this is the proximal reason for the performance difference, but have not proven that.

No, I’m suggesting that the reason for the performance difference may be due to some fundamental kernel characteristic in the compiled code that causes multiple load operations per element of input data, in the “slower” case (in the non-V100 case). I don’t have any data that connects this with pointer chasing yet, that would be purely speculation.

I understand you would like a crisp answer but I don’t have it at this point. Its possible that the compiler is doing something unexpected, and/or its possible that I simply haven’t recognized what is going on and it is to be expected. If the compiler is doing something unexpected, it might be considered a bug. I wouldn’t normally expect that simple act of wrapping kernel parameters in a struct would have a 20% performance impact, and the V100 test suggests to me it doesn’t have to be that way.

The options at this point from what I can see are:

  1. continue the analysis
  2. file a bug, and hope someone else will do the analysis for you

I’m not able to provide reliable advice at this point to explain how to navigate this situation, with the exception that passing raw pointers doesn’t seem to be burdened by this variability, so if that is an option, and you are interested in not having to worry about this kind of variability, then that would appear to be a solution at this juncture. It’s not immediately obvious to me why you would need to wrap kernel arguments in a struct, but I think that usually arises as a matter of programming convenience.

ok, now I get it. Interesting and again many thanks for your commitment :)

Since a V100 is used for benchmarking my routines anyway, I might just be able to ignore this for now and file a bug.

“It’s not immediately obvious to me why you would need to wrap kernel arguments in a struct, but I think that usually arises as a matter of programming convenience.”
So far I’ve just passed the pointers to the actual data as arguments.

In my case I have many layers of abstration Topology->level->node->childid
Such that if I launch a kernel passing the actual data I have to jump down all layers of abstraction and within the kernel I have to use free functions to handle the data.
I’m afraid that no one will ever be able to understand, not to mention be able to maintain this ;).
To me it seems much more convenient to pass the topology to the kernel and e.g. pass down each level get corresponding nodes etc. .

After some discussion with a colleague, it may be that the architectural differences in V100 are such that the extra cost associated with referencing the kernel “parameters” in managed memory (by-reference case) vs. bare pointers in constant memory (by-value case) are such that the performance difference on V100 can be explained that way. i.e. the extra cost is less of an issue (latency and L1 cache differences) on V100 vs. older architectures.

I don’t know how to connect that with the profiler observation that differing ratios of global loads to global stores are generated in the non-V100 case (3:1) vs. the V100 case (1.5:1). It seems that should be discoverable by inspection of the SASS, but I’ve not had time to drill into it. It may be a red herring.

Very interesting!
Do you know if this only happens with unified memory or could the same situation also be happening with non-unified memory?

I filed a bug by now, hoping that someone might have some clue as to what is happening.
I’m also surprised that this seems to be the first time someone has come across this issue.

Thats interesting because I am also using struct to store kernels paramaters for readability purpose (sometimes I have to pass 3 or 4 images on a kernel with width / height / pitch and pointer to memory for each).

I am using cudaMalloc and not cudaMallocManaged in struct constructor thow.

Currently the project is not matured enough to start this kind of optimization tests but I will do that in the future and let you know here of the results.

I am working with CUDA 10 on linux with RTX 5000 / RTX 2080Ti and T4 GPUs.

Just a basic question, I am using a function to free device memory of my structs because if I declare it in a destructor, it just spawns on device functions. Is there a way to avoid the call of struct destructor in device functions ? (like declare it with host qualifier ?) Letting CPU handles the life of thoses struct would have been pretty neat.

Hi Romain,

that’s a good question and I was also looking into it. This is not working as far as I know. The reason being that the destructor of the struct is called on the host side anyways upon the end of the kernel call. Please correct me if I’m mistaken.

all the best,
Glenn

Hello Clarence, in C++ the destructor of a class / struct is automatically called at the end of the object life.

There is an internal reference counting system that determine if an object is no longer used.

To “keep an object alive” among scopes you can use smart pointers that also handle the object life correctly before calling the destructor at the end.

Passing a struct by ref or const ref in a gpu kernel function should not triggered its destructor because it came from an upper scope.

Passing a struct by value should trigger its destructor at the end of the kernel scope because a copy of this struct is made in the kernel and belong to its scope.

[EDIT] What I experimented is that passing a struct by value does not trigger its constructor but its destructor. I didn’t test passing by ref because I read a note about not passing pointer by ref in kernels (my struct has a pointer to an allocated device memory).