__CUDA_ARCH__ in object methods not working

Hello everyone,

I am trying to write a c++ wrapper for a global device memory buffer. In the constructor I allocate memory using cudaMalloc and in the destructor I want to free the memory only if the destructor gets called on host side. I read about the CUDA_ARCH macro and tried to apply this to my code. In free functions the macro works perfectly fine and also when compiling the macro seems to be present.
But when the program gets executed it seems as if the preprocessor did something wrong when an object method gets called.

Here is the code I am compiling:

class cuda_buffer {
  size_t n = 0;
  float* dev_buffer = nullptr;

  cuda_buffer(size_t _n);
  __host__ __device__ ~cuda_buffer();
/// allocate memory on gpu
cuda_buffer::cuda_buffer(size_t _n) : n{_n} {
  cudaMalloc(&dev_buffer, sizeof(float) * n);
/// free memory when object is deleted on host but do not free when on device
/// FIXME __CUDA_ARCH__ seems not to work in methods of classes
__host__ __device__ cuda_buffer::~cuda_buffer() {
#if defined(__CUDA_ARCH__)
#warning compiling device destructor
  printf("do not free on device %i\n", __CUDA_ARCH__);
#warning compiling host destructor                                              
  printf("free on host\n");                                                     
  dev_buffer = nullptr;                                                         
/// __CUDA_ARCH__ works fine in free functions
__host__ __device__ void foo(int i) {
#if defined(__CUDA_ARCH__)
#warning compiling device foo
      printf("foo on device %i\n", __CUDA_ARCH__);
#warning compiling host foo
      printf("foo on host %i\n", i);

__global__ void kernel(cuda_buffer copy) {
  // copy gets destroyed here; host code is executed on device

int main() {
  cuda_buffer original(8);
  kernel<<<3,1>>>(original); // foo gets called three times and correctly uses device code
  foo(2); // foo gets called on host and uses host code correctly
  // original gets destroyed here; host code is executed as wished

This is the compiler output:

cuda_arch.cu:22:2: warning: #warning compiling device destructor [-Wcpp]
   #warning compiling device destructor
  cuda_arch.cu:35:2: warning: #warning compiling device foo [-Wcpp]
   #warning compiling device foo
  cuda_arch.cu:25:2: warning: #warning compiling host destructor [-Wcpp]
   #warning compiling host destructor
  cuda_arch.cu:38:2: warning: #warning compiling host foo [-Wcpp]
   #warning compiling host foo

This is what the program prints:

free on host
foo on device 300
foo on device 300
foo on device 300
foo on host 2
free on host

The first line should be “do not free on device 300”

And some information about my system:

nvcc --version:
  nvcc: NVIDIA (R) Cuda compiler driver
  Copyright (c) 2005-2019 NVIDIA Corporation
  Built on Sun_Jul_28_19:07:16_PDT_2019
  Cuda compilation tools, release 10.1, V10.1.243

uname -r

compiled with the command
  nvcc cuda_arch.cu -O0

Is this a known issue or am I doing something wrong?

In C++, when you pass an object by-value to a function, a copy of that object is made for use by the function. This copy is made before the function body begins executing. Therefore, from a scoping perspective, the copy is made outside the function body. When the function completes/returns (after the function body completes, and the scope for that function body is terminated), the destructor for that object-copy is called. This is the way C++ chooses to implement pass-by-value for an object.

In your code, when you call the kernel, a copy of the object original is made for use by the kernel function. This object-copy is constructed using the object’s copy-constructor. Since you have not provided an explicit copy-constructor, a default copy-constructor is used (which does not contain cudaMalloc, of course). When the kernel function completes, the destructor is called on this object-copy. This object-copy is made in host code (outside the kernel function body), and therefore the destructor called on it is called from host code, and follows the host-code path in your explicit destructor. That path includes a call to cudaFree on the pointer that was originally returned by cudaMalloc when the constructor was called for original. Therefore that pointer (contained in the object-copy) is freed, at the point of kernel completion. This invocation of the destructor is what is printing the “free on host” message that appears first in your output.

Later in your code, the object original goes out of scope. At that point, your destructor is called on that object, and it prints the final “free on host” message that appears in your output. However there is a problem, which will be evident if you run your code with cuda-memcheck. The final invocation of the destructor is calling cudaFree on the same pointer that was returned by cudaMalloc (the pointer values are numerically the same between original and the object-copy, due to the behavior of the copy-constructor). This is illegal, since that pointer has already been freed, and cudaFree returns an error code at that point, which you are ignoring.

In terms of the behavior here, it is essentially all determined by C++ object behavior. CUDA is really just following C++ behavior, treating the kernel as a function call.

I always recommend that anytime someone is having trouble with a CUDA code, that they should implement proper CUDA error checking, and also run their code with cuda-memcheck, preferably before asking others for help. The error returned by cudaFree may have been a clue for you. Even if it is not, providing that information may be useful for those trying to help you.

Hi Robert,

Thanks for your response.

Maybe I have to add that this is just a minimal example of a larger code base. I am aware of that multiple free. To overcome that issue I tried to use the #ifdef in the destructor. The multiple free would of course also occur if one would create more “host copies” that would get free’d after main has finished.

It is intended that I use the default copy constructor because I want the same pointer in the device function and not a deep copy since in device code the same buffer must be writeable and copieable back to host. If one would create a deep copy the “output” of the kernel (the changed buffer) would never be visible because the deep copy would get destroyed right after the kernel has finished.

What I want to achieve is some sort of smart RAII behavior:
for host: create memory on device and delete if the scope of the object ends. (and also make deep copies, but for that minimal example that does not matter)
for device: create flat copies and do not delete when scope ends.

In my understanding the copy object in my minimal example would get created on the device since copy is only visible in the kernel function.

Is there a way to accomplish the behavior I want to create meaning create a device copy not a host copy?

The object copy is made in host code.

One possible method: you can pass the object by pointer. Allocate space for the object on the device using cudaMalloc. Copy the object from host to device using cudaMemcpy. Pass the pointer to device object copy and use that in your kernel code.

I think you could also create the object on the host using managed memory, and pass the object by reference to the device.

At the risk of repeating myself, this same problem would occur in ordinary CPU-only code, if you created a similar realization (an object copy would be made, and you would get a double free(), for example. Right?) If you solve the problem for that usage, that may instruct you how to solve it here.