malloc isn't found when used in a header file

examachine · December 15, 2010, 7:17pm

Hello there,

I’ve been trying to port a parallel search algorithm that requires dynamic memory allocation to CUDA. Thus I wanted to apply the device malloc function in CUDA 3.2 (explained in B.15.1). I am right now developing on a macbook pro that doesn’t have the required Fermi hardware, but on my desktop I will be able to run it. At any rate, the example malloc code in the programming manual does compile with the following command line on my laptop:

sirius:test malfunct$ nvcc -I/sw/include -arch compute_20 malloc_test.cu -o malloc_test

And I assume it will work on the GTX 480. However, there are two problems, first the printf function (explained in B.14 of the C programming manual) isn’t found, so I had to comment it out:

sirius:test malfunct$ cat malloc_test.cu 

//-*-c++-*-x

__device__ __host__ void mallocTest2()

{

  char* ptr = (char*)malloc(123); 

}

__global__ void mallocTest()

{

  char* ptr = (char*)malloc(123);

  mallocTest2();

  //printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);

}

int main()

{

  // Set a heap size of 128 megabytes. Note that this must

  // be done before any kernel is launched.

  cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);

  mallocTest<<<1, 5>>>();

  cudaThreadSynchronize();

  return 0;

}

By the way, the version of the nvcc I am using is:

sirius:test malfunct$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2010 NVIDIA Corporation

Built on Thu_Nov_11_15:26:50_PST_2010

Cuda compilation tools, release 3.2, V0.2.1221

The second, and bigger problem is, when I use the malloc function in a header file called dynarray.h, which is itself included from yet other header files, I get the following errors:

nvcc -I/sw/include -arch compute_20 syntax_tree.cu -o test_syntaxtree

./dynarray.h(175): error: identifier "malloc" is undefined

./dynarray.h(164): error: identifier "malloc" is undefined

          detected during:

            instantiation of "void Dynarray<A>::init_vec(int) [with A=char]" 

(24): here

            instantiation of "Dynarray<A>::Dynarray() [with A=char]" 

./string.h(16): here

2 errors detected in the compilation of "/tmp/tmpxft_00014dd8_00000000-4_syntax_tree.cpp1.ii".

make: *** [test_syntaxtree] Error 2

which is unexpected behavior to me. The functions in question were qualified as device host, however the mallocTest2() function I added to the example malloc code is also qualified like that and it does not give rise to any problems. When I copy that function to dynarray.h, however, the compiler complains that malloc is undefined. Here is the function defined, with the malloc call in line 175 of dynarray.h:

__device__ __host__ void mallocTest3()

{

  char* ptr = (char*)malloc(123); 

}

The verbose output of the failing compiler driver invocation:

sirius:cuda malfunct$ nvcc -I/sw/include -arch compute_20 syntax_tree.cu -o test_syntaxtree --verbose

#$ _SPACE_= 

#$ _CUDART_=cudart

#$ _HERE_=/usr/local/cuda/bin

#$ _THERE_=/usr/local/cuda/bin

#$ _TARGET_SIZE_=

#$ TOP=/usr/local/cuda/bin/..

#$ PATH=/usr/local/cuda/bin/../open64/bin:/usr/local/cuda/bin:/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/local/cuda/bin:/sw/bin:/sw/sbin:/usr/local/bin:/Users/malfunct/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/bin

#$ INCLUDES="-I/usr/local/cuda/bin/../include"  

#$ LIBRARIES=  "-L/usr/local/cuda/bin/../lib" -lcudart

#$ CUDAFE_FLAGS=

#$ OPENCC_FLAGS=

#$ PTXAS_FLAGS=

#$ gcc -D__CUDA_ARCH__=200 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"   -I. -D__CUDACC__ -C  -I"/sw/include" -include "cuda_runtime.h" -m32 -malign-double -o "/tmp/tmpxft_00014e5e_00000000-4_syntax_tree.cpp1.ii" "syntax_tree.cu" 

#$ cudafe --m32 --gnu_version=40201 -tused --no_remove_unneeded_entities  --gen_c_file_name "/tmp/tmpxft_00014e5e_00000000-1_syntax_tree.cudafe1.c" --stub_file_name "/tmp/tmpxft_00014e5e_00000000-1_syntax_tree.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_00014e5e_00000000-1_syntax_tree.cudafe1.gpu" --include_file_name "/tmp/tmpxft_00014e5e_00000000-3_syntax_tree.fatbin.c" "/tmp/tmpxft_00014e5e_00000000-4_syntax_tree.cpp1.ii" 

./dynarray.h(175): error: identifier "malloc" is undefined

./dynarray.h(164): error: identifier "malloc" is undefined

          detected during:

            instantiation of "void Dynarray<A>::init_vec(int) [with A=char]" 

(24): here

            instantiation of "Dynarray<A>::Dynarray() [with A=char]" 

./string.h(16): here

2 errors detected in the compilation of "/tmp/tmpxft_00014e5e_00000000-4_syntax_tree.cpp1.ii".

# --error 0x2 --

Could you please help me resolve these problems? I am hoping perhaps this is due to my inexperience with the nvcc toolkit. Thanks in advance!

Best Regards,

–

Eray Ozkural

PS: I didn’t consider it an OS X issue, but given that I haven’t tried it on linux yet, it might as well be. If you feel that is the case please remove this post or move it to the OS X section.

njuffa · December 15, 2010, 9:03pm

Use of malloc() requires a

#include <stdlib.h>

at the top of the file. Similarly, use of printf() requires a

#include <stdio.h>

examachine · December 16, 2010, 2:27am

Thank you for your quick reply. Much appreciated.

Ah ok so their use inside a kernel function isn’t enabled by nvcc specific include’s. This isn’t clear in the relevant manual section though. And when I do that, that is uncomment the stdlib include in my code, the malloc example does compile, however my program fails me in this manner:

sirius:cuda malfunct$ nvcc -I/sw/include -arch compute_20 syntax_tree.cu -o test_syntaxtree

./dynarray.h(176): error: calling a host function from a __device__/__global__ function is not allowed

1 error detected in the compilation of "/tmp/tmpxft_00017caf_00000000-4_syntax_tree.cpp1.ii".

this error seems incorrect, though, because the called function is malloc, which the compiler isn’t able to detect, it seems. line 176 is the one with malloc call, and stdlib.h is included now in the beginning of the same dynarray.h file.

__device__ __host__ void mallocTest3()

{

  char* ptr = (char*)malloc(123); 

}

Which was what led me to report this bug in the first place. Can you please suggest me to try out a few things so we can diagnose the source of this erroneous behavior? It’s sort of critical that I can run this, I have gone through hoops trying to adapt thrust and g++'s stdlib implementation for this purpose. :)

Best Regards,

–

Eray Ozkural

njuffa · December 16, 2010, 3:54am

I am afraid this is getting quickly into areas where I lack the necessary expertise. With my suggested fix, the previously posted sample program compiles and runs fine on a C2050/M2050 (see below).

RHEL 64 bit

~/[…]/r3.2 $ nvcc -arch compute_20 malloc_test.cu -o malloc_test

~/[…]/r3.2 $ ./malloc_test

Thread 0 got pointer: 0xf20831fe20

Thread 1 got pointer: 0xf20831fec0

Thread 2 got pointer: 0xf20831ff60

Thread 3 got pointer: 0xf2083197c0

Thread 4 got pointer: 0xf208319720

WinXP64

C:[…]\apps>nvcc -arch compute_20 malloc_test.cu -o malloc_test

C:[…]\apps>malloc_test

Thread 0 got pointer: 000000020841FE20

Thread 1 got pointer: 000000020841FEC0

Thread 2 got pointer: 000000020841FF60

Thread 3 got pointer: 00000002084197C0

Thread 4 got pointer: 0000000208419720

#include <stdio.h>

#include <stdlib.h>

__device__ __host__ void mallocTest2()

{  

    char* ptr = (char*)malloc(123); 

}

__global__ void mallocTest()

{  

    char* ptr = (char*)malloc(123);  

    mallocTest2();  

    printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);

}

int main(void)

{  

    // Set a heap size of 128 megabytes. Note that this must  

    // be done before any kernel is launched.  

    cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);  

    mallocTest<<<1, 5>>>();  

    cudaThreadSynchronize();  

    return 0;

}

examachine · December 16, 2010, 10:03am

Hi there,

As I said, the said malloc test program does compile, it compiles with or without including stdlib.h! (It compiles correctly with no error messages in both cases)

On the other hand, the program that I wrote, that includes the very same mallocTest2 function as the test program fails with a nonsensical error message, regardless of whether I include stdlib.h or not. (It fails in both cases with different wrong error messages) Can you suggest me some compiler switches or how to track down this error? Or should I make a minimal test case and post it here? (I unfortunately can’t post the entire code right now)

Where exactly is the device malloc defined for instance? Could this be a simple C preprocessor error? In a similar compiler for an accel. that I was helping to develop, such errors could occur at any stage of the compiler driver so I wouldn’t be surprised :) However, I honestly couldn’t find my way around this one, and I do need some guidance.

Best,

I am afraid this is getting quickly into areas where I lack the necessary expertise. With my suggested fix, the previously posted sample program compiles and runs fine on a C2050/M2050 (see below).

RHEL 64 bit

~/[…]/r3.2 $ nvcc -arch compute_20 malloc_test.cu -o malloc_test

~/[…]/r3.2 $ ./malloc_test

Thread 0 got pointer: 0xf20831fe20

Thread 1 got pointer: 0xf20831fec0

Thread 2 got pointer: 0xf20831ff60

Thread 3 got pointer: 0xf2083197c0

Thread 4 got pointer: 0xf208319720

WinXP64

C:[…]\apps>nvcc -arch compute_20 malloc_test.cu -o malloc_test

C:[…]\apps>malloc_test

Thread 0 got pointer: 000000020841FE20

Thread 1 got pointer: 000000020841FEC0

Thread 2 got pointer: 000000020841FF60

Thread 3 got pointer: 00000002084197C0

Thread 4 got pointer: 0000000208419720
#include <stdio.h>

#include <stdlib.h>

__device__ __host__ void mallocTest2()

{  

    char* ptr = (char*)malloc(123); 

}

__global__ void mallocTest()

{  

    char* ptr = (char*)malloc(123);  

    mallocTest2();  

    printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);

}

int main(void)

{  

    // Set a heap size of 128 megabytes. Note that this must  

    // be done before any kernel is launched.  

    cudaThreadSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);  

    mallocTest<<<1, 5>>>();  

    cudaThreadSynchronize();  

    return 0;

}

examachine · December 16, 2010, 10:17am

sirius:cuda malfunct$ grep malloc -r /usr/local/cuda/include/

/usr/local/cuda/include/common_functions.h:extern _CRTIMP __host__ __device__ void*   __cdecl malloc(size_t) __THROW;

Please note that this is where the malloc free and printf functions are overridden.

sirius:cuda malfunct$ grep common\_functions -r /usr/local/cuda/include/

/usr/local/cuda/include/crt/host_runtime.h:#include "common_functions.h"

/usr/local/cuda/include/cuda_runtime.h:#include "common_functions.h"

Now, since the compiler driver automatically includes “cuda_runtime.h” (or seems to in --verbose mode) I don’t think I should have to #include anything, right?

#if defined(__cplusplus) && defined(__CUDACC__)

....

#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 200

#include <stdio.h>

#include <stdlib.h>

extern "C"

{

/*DEVICE_BUILTIN*/

extern _CRTIMP __host__ __device__ int     __cdecl printf(const char*, ...);

extern _CRTIMP __host__ __device__ void*   __cdecl malloc(size_t) __THROW;

extern _CRTIMP __host__ __device__ void    __cdecl free(void*) __THROW;

}

#endif /* __CUDA_ARCH__ && __CUDA_ARCH__ >= 200 */

#endif /* __cplusplus && __CUDACC__ */

Is it possible that any of these switches would be inactive in a header file that is indirectly loaded (like my dynarray.h) causing the compiler to blurt the error messages? It seems that the “cuda_runtime.h” file is included for some files, and not for some others. Can this be possible? If it is, I think it’s a bug, the expected behavior is that the same NVCC includes must be present in every file that is processed in a uniform fashion.

Best Regards,

–

Eray Ozkural

Cliff_Woolley · December 16, 2010, 9:28pm

Now, since the compiler driver automatically includes “cuda_runtime.h” (or seems to in --verbose mode) I don’t think I should have to include anything, right?
#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 200

#include <stdio.h>

#include <stdlib.h>

Actually, you do; CUDA_ARCH is only defined in device code. In general, the best practice (as you’d expect in C code just as in CUDA C) is to explicitly include the system headers you need, even if some other header file might in some cases include them on your behalf.

For what it’s worth, I tested several variations of your example code here (including a modified version that calls malloc from a .h file that is include’d by this .cu file), and it worked as expected on my Linux box, the same as it did for Norbert, as long as I include’d stdio.h and stdlib.h. It also compiled successfully on my Mac, but I can’t test it on that box since it doesn’t have a Fermi GPU in it at the moment.

If you could provide us with a small example that reproduces the problem you’re seeing, that would be very helpful. Meanwhile, we have filed a documentation bug here to better explain the expectations for system headers and their interactions with device-code syscalls.

Thanks,

Cliff
foo6.h (117 Bytes)
foo6.cu (372 Bytes)

examachine · December 16, 2010, 9:52pm

My problem may have something to do with leftover files or with some kind of leftover state when two .cu sources for different executables reside on the same directory. Now after I separated the build of malloc_test.cu elsewhere and carefully cleaned up the source dirs, I can compile my second test code by including the standard C headers. Which I would expect to be mentioned in the documentation as you suggest. If I can make a testcase for the leftover-state bug, I will try to report it separately.

Thanks for your help! It’s much appreciated.

Cliff_Woolley · December 16, 2010, 9:54pm

No problem. Do let us know if you manage to reproduce the issue. All of the intermediate files are normally deleted between invocations of nvcc, so I’ll be very curious to see what’s going on here. Ideally, you could file the bug yourself from our registered developer site and then let me know what the bug # is so that I can help you track it.

Thanks,

Cliff

examachine · December 19, 2010, 8:48pm

I have reproduced the issue this time with a printf call in another header file of the same project (this time a file called string.h), but not in an isolated test case yet. However, when I give the --keep flag, the preprocessed file reveals that the common_functions.h include is included after string.h. This looks like a bug? And I can’t resolve it even by trying to include <cuda_runtime.h> before the problematic inline function definition. I think that the nvcc preprocessor is trying to determine where C standard includes are included so that it can override them with defs from cuda_runtime.h, or otherwise it is fiddling with the include order somehow, which is not generally working. If I can make a test case for it, I will file a bug.

I guess this might also be related to enabling separate compilation in nvcc. Somehow the preprocessing behavior diverges from gcc behavior, that is the problem.

Best,

Eray

Topic		Replies	Views
Can we do malloc inside a __global__ function CUDA Programming and Performance	26	9628	February 21, 2010
NVCC forces c++ compilation of .cu files CUDA Programming and Performance	11	25452	December 11, 2011
Nvcc compiler options on Eclipse CUDA Programming and Performance	12	25904	June 2, 2010
Ubuntu 20.04, GCC 9.3, Cuda Toolkit 11.3 - not a supported combination? CUDA Programming and Performance	11	8676	November 4, 2021
CUDA compile trouble CUDA Programming and Performance	47	5104	November 8, 2010
Compiling error: undefined reference to `cudaMalloc' CUDA Programming and Performance	13	39183	November 2, 2022
CUDA and autotools configure.ac, makefile.am, ... CUDA Programming and Performance	20	21506	November 29, 2012
Build Error MSB3721 When calling object method within kernel, using compiler directives CUDA Programming and Performance	9	5713	November 18, 2015
malloc memory in kernel linked via in/out variable CUDA Programming and Performance	10	1918	October 17, 2015
Is it thread-safe to malloc in threads of a kernel function? CUDA Programming and Performance	7	2265	December 8, 2017

malloc isn't found when used in a header file

Related topics