Why can't get expected result from __nv_aligned_device_malloc?

Hi folks,

I’m learning the samples on programming guide, and wrote a simple kernel to verify the function in title allocated aligned addresses. But I can’t get what I wanted.
global void kernel_aligned_malloc(uint8_t *ptr, size_t size, size_t align, int loop) {
uint8_t *dptr;
auto is_aligned = [&](uint8_t p) {return ((uint64_t)(p) & (align - 1)) == 0;};
for (int l = 1; l <= loop; l ++) {
dptr = (uint8_t
)__nv_aligned_device_malloc(size, align);
memset(dptr, (l & 0xff), size);
if (!is_aligned(dptr)) {
ptr[0] = uint8_t((uint64_t)(dptr) & 0xff);
free(dptr);
return;
}
memcpy(ptr, dptr, size);
free(dptr);
}
}
Then I used the cuda-memcheck, it said the below errors. Did I do something wrong?
========= Malloc/Free error encountered : Double free
========= at 0x00000d40 in __cuda_syscall_mc_dyn_globallock_free
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x7ff8b2dff920

========= Program hit cudaErrorLaunchFailure (error 719) due to “unspecified launch failure” on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x34fb13]
========= Host Frame:./build/test_heap_memory_aligned_malloc [0x9a845]
========= Host Frame:./build/test_heap_memory_aligned_malloc [0xeb45]
========= Host Frame:./build/test_heap_memory_aligned_malloc [0xdf07]
========= Host Frame:./build/test_heap_memory_aligned_malloc [0xe0a0]
========= Host Frame:./build/test_heap_memory_aligned_malloc [0x3506c]
========= Host Frame:./build/test_heap_memory_aligned_malloc [0x2fe21]
========= Host Frame:./build/test_heap_memory_aligned_malloc [0x14fd6]

please show the full code.

Thanks man. Btw I ran it another time today and it got passed :)

#include <iostream>
#include <stdint.h>
using namespace std;
__global__ void kernel_aligned_malloc(uint8_t *ptr, size_t size, size_t align, int loop) {
  uint8_t *dptr;
  auto is_aligned = [&](uint8_t *p) {return ((uint64_t)(p) & (align - 1)) == 0;};
  for (int l = 1; l <= loop; l ++) {
    dptr = (uint8_t*)__nv_aligned_device_malloc(size, align);
    memset(dptr, (l & 0xff), size);
    if (!is_aligned(dptr)) {
      ptr[0] = uint8_t((uint64_t)(dptr) & 0xff);
      free(dptr);
      return;
    }
    memcpy(ptr, dptr, size);
    free(dptr);
  }
}

int main() {
  uint8_t *dp;
  cudaMalloc((void**)&dp, sizeof(uint8_t));
  kernel_aligned_malloc<<<1, 1>>>(dp, 16, 16, 2);
  uint8_t h;
  cudaMemcpy(&h, dp, sizeof(uint8_t), cudaMemcpyDeviceToHost);
  cout << "Result = " << (uint32_t)h << endl;
  return 0;
}

You have illegal behavior here:

memcpy(ptr, dptr, size);

You are copying 16 bytes to ptr but you have only allocated 1 byte:

cudaMalloc((void**)&dp, sizeof(uint8_t));

When I fix that issue, your code runs with no runtime errors for me, on a cc 7.5 device on CUDA 12.0

If you’re still having trouble after fixing that issue, my first suggestion is to update your CUDA install to 12.0. If you still observe problems after updating to CUDA 12.0, please identify the actual GPU you are running this on, and the compile command line you are using.

Great! Sorry I missed that! Thanks so much !