malloc isn't found when used in a header file

Hello there,

I’ve been trying to port a parallel search algorithm that requires dynamic memory allocation to CUDA. Thus I wanted to apply the device malloc function in CUDA 3.2 (explained in B.15.1). I am right now developing on a macbook pro that doesn’t have the required Fermi hardware, but on my desktop I will be able to run it. At any rate, the example malloc code in the programming manual does compile with the following command line on my laptop:

sirius:test malfunct$ nvcc -I/sw/include -arch compute_20 malloc_test.cu -o malloc_test

And I assume it will work on the GTX 480. However, there are two problems, first the printf function (explained in B.14 of the C programming manual) isn’t found, so I had to comment it out:

sirius:test malfunct$ cat malloc_test.cu 

//-*-c++-*-x

__device__ __host__ void mallocTest2()

{

  char* ptr = (char*)malloc(123); 

}

__global__ void mallocTest()

{

  char* ptr = (char*)malloc(123);

  mallocTest2();

  //printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);

}

int main()

{

  // Set a heap size of 128 megabytes. Note that this must

  // be done before any kernel is launched.

  cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);

  mallocTest<<<1, 5>>>();

  cudaThreadSynchronize();

  return 0;

}

By the way, the version of the nvcc I am using is:

sirius:test malfunct$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2010 NVIDIA Corporation

Built on Thu_Nov_11_15:26:50_PST_2010

Cuda compilation tools, release 3.2, V0.2.1221

The second, and bigger problem is, when I use the malloc function in a header file called dynarray.h, which is itself included from yet other header files, I get the following errors:

nvcc -I/sw/include -arch compute_20 syntax_tree.cu -o test_syntaxtree

./dynarray.h(175): error: identifier "malloc" is undefined

./dynarray.h(164): error: identifier "malloc" is undefined

          detected during:

            instantiation of "void Dynarray<A>::init_vec(int) [with A=char]" 

(24): here

            instantiation of "Dynarray<A>::Dynarray() [with A=char]" 

./string.h(16): here

2 errors detected in the compilation of "/tmp/tmpxft_00014dd8_00000000-4_syntax_tree.cpp1.ii".

make: *** [test_syntaxtree] Error 2

which is unexpected behavior to me. The functions in question were qualified as device host, however the mallocTest2() function I added to the example malloc code is also qualified like that and it does not give rise to any problems. When I copy that function to dynarray.h, however, the compiler complains that malloc is undefined. Here is the function defined, with the malloc call in line 175 of dynarray.h:

__device__ __host__ void mallocTest3()

{

  char* ptr = (char*)malloc(123); 

}

The verbose output of the failing compiler driver invocation:

sirius:cuda malfunct$ nvcc -I/sw/include -arch compute_20 syntax_tree.cu -o test_syntaxtree --verbose

#$ _SPACE_= 

#$ _CUDART_=cudart

#$ _HERE_=/usr/local/cuda/bin

#$ _THERE_=/usr/local/cuda/bin

#$ _TARGET_SIZE_=

#$ TOP=/usr/local/cuda/bin/..

#$ PATH=/usr/local/cuda/bin/../open64/bin:/usr/local/cuda/bin:/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/local/cuda/bin:/sw/bin:/sw/sbin:/usr/local/bin:/Users/malfunct/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/bin

#$ INCLUDES="-I/usr/local/cuda/bin/../include"  

#$ LIBRARIES=  "-L/usr/local/cuda/bin/../lib" -lcudart

#$ CUDAFE_FLAGS=

#$ OPENCC_FLAGS=

#$ PTXAS_FLAGS=

#$ gcc -D__CUDA_ARCH__=200 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"   -I. -D__CUDACC__ -C  -I"/sw/include" -include "cuda_runtime.h" -m32 -malign-double -o "/tmp/tmpxft_00014e5e_00000000-4_syntax_tree.cpp1.ii" "syntax_tree.cu" 

#$ cudafe --m32 --gnu_version=40201 -tused --no_remove_unneeded_entities  --gen_c_file_name "/tmp/tmpxft_00014e5e_00000000-1_syntax_tree.cudafe1.c" --stub_file_name "/tmp/tmpxft_00014e5e_00000000-1_syntax_tree.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00014e5e_00000000-1_syntax_tree.cudafe1.gpu" --include_file_name "/tmp/tmpxft_00014e5e_00000000-3_syntax_tree.fatbin.c" "/tmp/tmpxft_00014e5e_00000000-4_syntax_tree.cpp1.ii" 

./dynarray.h(175): error: identifier "malloc" is undefined

./dynarray.h(164): error: identifier "malloc" is undefined

          detected during:

            instantiation of "void Dynarray<A>::init_vec(int) [with A=char]" 

(24): here

            instantiation of "Dynarray<A>::Dynarray() [with A=char]" 

./string.h(16): here

2 errors detected in the compilation of "/tmp/tmpxft_00014e5e_00000000-4_syntax_tree.cpp1.ii".

# --error 0x2 --

Could you please help me resolve these problems? I am hoping perhaps this is due to my inexperience with the nvcc toolkit. Thanks in advance!

Best Regards,

Eray Ozkural

PS: I didn’t consider it an OS X issue, but given that I haven’t tried it on linux yet, it might as well be. If you feel that is the case please remove this post or move it to the OS X section.

Use of malloc() requires a

#include <stdlib.h>

at the top of the file. Similarly, use of printf() requires a

#include <stdio.h>

Thank you for your quick reply. Much appreciated.

Ah ok so their use inside a kernel function isn’t enabled by nvcc specific include’s. This isn’t clear in the relevant manual section though. And when I do that, that is uncomment the stdlib include in my code, the malloc example does compile, however my program fails me in this manner:

sirius:cuda malfunct$ nvcc -I/sw/include -arch compute_20 syntax_tree.cu -o test_syntaxtree

./dynarray.h(176): error: calling a host function from a __device__/__global__ function is not allowed

1 error detected in the compilation of "/tmp/tmpxft_00017caf_00000000-4_syntax_tree.cpp1.ii".

this error seems incorrect, though, because the called function is malloc, which the compiler isn’t able to detect, it seems. line 176 is the one with malloc call, and stdlib.h is included now in the beginning of the same dynarray.h file.

__device__ __host__ void mallocTest3()

{

  char* ptr = (char*)malloc(123); 

}

Which was what led me to report this bug in the first place. Can you please suggest me to try out a few things so we can diagnose the source of this erroneous behavior? It’s sort of critical that I can run this, I have gone through hoops trying to adapt thrust and g++'s stdlib implementation for this purpose. :)

Best Regards,

Eray Ozkural

I am afraid this is getting quickly into areas where I lack the necessary expertise. With my suggested fix, the previously posted sample program compiles and runs fine on a C2050/M2050 (see below).

RHEL 64 bit


~/[…]/r3.2 $ nvcc -arch compute_20 malloc_test.cu -o malloc_test

~/[…]/r3.2 $ ./malloc_test

Thread 0 got pointer: 0xf20831fe20

Thread 1 got pointer: 0xf20831fec0

Thread 2 got pointer: 0xf20831ff60

Thread 3 got pointer: 0xf2083197c0

Thread 4 got pointer: 0xf208319720

WinXP64


C:[…]\apps>nvcc -arch compute_20 malloc_test.cu -o malloc_test

C:[…]\apps>malloc_test

Thread 0 got pointer: 000000020841FE20

Thread 1 got pointer: 000000020841FEC0

Thread 2 got pointer: 000000020841FF60

Thread 3 got pointer: 00000002084197C0

Thread 4 got pointer: 0000000208419720

#include <stdio.h>

#include <stdlib.h>

__device__ __host__ void mallocTest2()

{  

    char* ptr = (char*)malloc(123); 

}

__global__ void mallocTest()

{  

    char* ptr = (char*)malloc(123);  

    mallocTest2();  

    printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);

}

int main(void)

{  

    // Set a heap size of 128 megabytes. Note that this must  

    // be done before any kernel is launched.  

    cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);  

    mallocTest<<<1, 5>>>();  

    cudaThreadSynchronize();  

    return 0;

}

Hi there,

As I said, the said malloc test program does compile, it compiles with or without including stdlib.h! (It compiles correctly with no error messages in both cases)

On the other hand, the program that I wrote, that includes the very same mallocTest2 function as the test program fails with a nonsensical error message, regardless of whether I include stdlib.h or not. (It fails in both cases with different wrong error messages) Can you suggest me some compiler switches or how to track down this error? Or should I make a minimal test case and post it here? (I unfortunately can’t post the entire code right now)

Where exactly is the device malloc defined for instance? Could this be a simple C preprocessor error? In a similar compiler for an accel. that I was helping to develop, such errors could occur at any stage of the compiler driver so I wouldn’t be surprised :) However, I honestly couldn’t find my way around this one, and I do need some guidance.

Best,

sirius:cuda malfunct$ grep malloc -r /usr/local/cuda/include/

/usr/local/cuda/include/common_functions.h:extern _CRTIMP __host__ __device__ void*   __cdecl malloc(size_t) __THROW;

Please note that this is where the malloc free and printf functions are overridden.

sirius:cuda malfunct$ grep common\_functions -r /usr/local/cuda/include/

/usr/local/cuda/include/crt/host_runtime.h:#include "common_functions.h"

/usr/local/cuda/include/cuda_runtime.h:#include "common_functions.h"

Now, since the compiler driver automatically includes “cuda_runtime.h” (or seems to in --verbose mode) I don’t think I should have to #include anything, right?

#if defined(__cplusplus) && defined(__CUDACC__)

....

#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 200

#include <stdio.h>

#include <stdlib.h>

extern "C"

{

/*DEVICE_BUILTIN*/

extern _CRTIMP __host__ __device__ int     __cdecl printf(const char*, ...);

extern _CRTIMP __host__ __device__ void*   __cdecl malloc(size_t) __THROW;

extern _CRTIMP __host__ __device__ void    __cdecl free(void*) __THROW;

}

#endif /* __CUDA_ARCH__ && __CUDA_ARCH__ >= 200 */

#endif /* __cplusplus && __CUDACC__ */

Is it possible that any of these switches would be inactive in a header file that is indirectly loaded (like my dynarray.h) causing the compiler to blurt the error messages? It seems that the “cuda_runtime.h” file is included for some files, and not for some others. Can this be possible? If it is, I think it’s a bug, the expected behavior is that the same NVCC includes must be present in every file that is processed in a uniform fashion.

Best Regards,

Eray Ozkural

Actually, you do; CUDA_ARCH is only defined in device code. In general, the best practice (as you’d expect in C code just as in CUDA C) is to explicitly include the system headers you need, even if some other header file might in some cases include them on your behalf.

For what it’s worth, I tested several variations of your example code here (including a modified version that calls malloc from a .h file that is include’d by this .cu file), and it worked as expected on my Linux box, the same as it did for Norbert, as long as I include’d stdio.h and stdlib.h. It also compiled successfully on my Mac, but I can’t test it on that box since it doesn’t have a Fermi GPU in it at the moment.

If you could provide us with a small example that reproduces the problem you’re seeing, that would be very helpful. Meanwhile, we have filed a documentation bug here to better explain the expectations for system headers and their interactions with device-code syscalls.

Thanks,

Cliff
foo6.h (117 Bytes)
foo6.cu (372 Bytes)

My problem may have something to do with leftover files or with some kind of leftover state when two .cu sources for different executables reside on the same directory. Now after I separated the build of malloc_test.cu elsewhere and carefully cleaned up the source dirs, I can compile my second test code by including the standard C headers. Which I would expect to be mentioned in the documentation as you suggest. If I can make a testcase for the leftover-state bug, I will try to report it separately.

Thanks for your help! It’s much appreciated.

No problem. Do let us know if you manage to reproduce the issue. All of the intermediate files are normally deleted between invocations of nvcc, so I’ll be very curious to see what’s going on here. Ideally, you could file the bug yourself from our registered developer site and then let me know what the bug # is so that I can help you track it.

Thanks,

Cliff

I have reproduced the issue this time with a printf call in another header file of the same project (this time a file called string.h), but not in an isolated test case yet. However, when I give the --keep flag, the preprocessed file reveals that the common_functions.h include is included after string.h. This looks like a bug? And I can’t resolve it even by trying to include <cuda_runtime.h> before the problematic inline function definition. I think that the nvcc preprocessor is trying to determine where C standard includes are included so that it can override them with defs from cuda_runtime.h, or otherwise it is fiddling with the include order somehow, which is not generally working. If I can make a test case for it, I will file a bug.

I guess this might also be related to enabling separate compilation in nvcc. Somehow the preprocessing behavior diverges from gcc behavior, that is the problem.

Best,

Eray