I call malloc in each thread of a kernel function, however I notice a buffer allocated earlier by malloc is corrupted after a new allocation by malloc. The patterns written to the old buffer are special, so I doubt the corruptation is caused by malloc. A value is written in the head of each 80 bytes, and these value increase 0x10000000.
I set a breakpoint at the malloc. When the program is paused at the breakpoint, the data is right. After I execute single step to cross the malloc, then the data is corrupted.(please look the second image).
Is it safe to call malloc in each thread of a kernel? The new malloc return a reasonable address(not 0), so I think the heap is not full. Actually, I am sure the sizes of each buffer is small(<100B) and the number of active threads is few(<30).
I tried set address breakpoint to catch write operation at the address, but nothing happen.
It should be safe to use in-kernel malloc in a multi-threaded way. If you have a short, complete example code that demonstrates the corruption, others may be able to help you. It may be the case that you have some other defect in your code giving rise to the corruption.
You should also make sure you are not running out of device heap space, but in that case malloc will return a NULL pointer.
Hi txbob,
Thanks. I will try to simplify my program.
I provide some screenshots to describe the issue.
I use breakpoints to monitor the case, so other defects should not be executed to cause the situation, right?
The kernel fail() tries to allocate much more memory than is available on the GPU (each thread allocates 128 KB, and there are 2**26 threads, for a total in the TB range). So many of the device-side malloc() calls will fail. But the pointer returned by malloc() is used without checking. That can’t be good. What do you expect to happen?
Your number of simultaneous threads executing is incorrect. But to make the example provided more direct, the following decreases the allocation size, and checks the pointers - same result.
Given it works on the K20 it might depend on the compute capability? I can’t see how the second version of my code should generate any error at all - no matter the heap size it should just bail on NULLs now.
cuda-memcheck:
========= CUDA-MEMCHECK
========= Out-of-range Shared or Local Address
========= at 0x00000450
========= by thread (15,11,0) in block (90,0,0)
========= Device Frame:fail(void) (fail(void) : 0x70)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]
========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x32f6a3]
========= Host Frame:./example [0x37160]
========= Host Frame:./example [0x31c9]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]