Overwriting device-side malloc()

Hi!

I want to wrap the device-side malloc() function with a function of the same name to enable additional functionality (like logging) without changing the calling code. However, I have no idea how to reach the original malloc.

For host-code, I found solutions like this [1]:

static void* (*real_malloc)(size_t)=NULL;

void saveMalloc(){
  real_malloc = dlsym(RTLD_NEXT, "malloc");
}

void* malloc(size_t size)
{
  printf("log");
  return real_malloc(size);
}

Is there any way to do something similar for device malloc(size_t)?
I tried something like this

#include <stdio.h>

__device__ static void* (*real_malloc)(size_t) = malloc;

__device__ void* malloc(size_t size) throw()
{
  printf("Allocating memory on device\n");
  return real_malloc(size);
}

__global__ void test(){
  malloc(32);
}

int main()
{
  test<<<1,1>>>();
  return 0;
}

However, this seems to result in a recursive call.
ptxas warning : Stack size for entry function ‘_Z4testv’ cannot be statically determined

Hi slizzered, the easiest way to hook malloc() on the GPU is to wrap it with a different function. E.g. :

__device__ void* myMalloc(size_t size) nothrow()
{ 
   printf("Wrapped malloc\n"); 
   return malloc(size);
} 

__global__ void test()
{
   myMalloc(32);
}

There is currently no support for dynamic linking on the GPU, and as a result, the way device side libc functions (malloc(), free(), printf()) are linked is different from how its implemented on the CPU. The GPU linking mechanism does not have any way for the user to intercept the call without changing the call site.

What kind of information are you trying to capture using this interception layer ? If you are trying to debug something in the device side malloc() calls, have you tried using cuda-memcheck ?

Hi vyas!
Unfortunately, debugging/cuda-memcheck is not what this is about. I’m trying to build an abstraction for malloc, so that it becomes possible to switch the underlying memory allocator to a different implementation (e.g. a pool-based allocator). At the same time, there should be a way to use the original malloc() internally.

Thanks for your code! This approach works well, but I had hoped to call my wrapper-function “malloc”, so that the calling code doesn’t need to be changed. Also, as I understood it from my testing, overwriting malloc() results in automagically overwriting operator new() and operator new as well. Choosing a new name like “myMalloc” will make it impossible for the user of this libray to use new in order to get memory from this allocator.

Well, I guess there is not really a more elegant way for now.