Is it thread-safe to malloc in threads of a kernel function?

I call malloc in each thread of a kernel function, however I notice a buffer allocated earlier by malloc is corrupted after a new allocation by malloc. The patterns written to the old buffer are special, so I doubt the corruptation is caused by malloc. A value is written in the head of each 80 bytes, and these value increase 0x10000000.

I set a breakpoint at the malloc. When the program is paused at the breakpoint, the data is right. After I execute single step to cross the malloc, then the data is corrupted.(please look the second image).

Is it safe to call malloc in each thread of a kernel? The new malloc return a reasonable address(not 0), so I think the heap is not full. Actually, I am sure the sizes of each buffer is small(<100B) and the number of active threads is few(<30).

I tried set address breakpoint to catch write operation at the address, but nothing happen.

memory before corruption

memory after corruption


Screenshot_20171026_002252.png
Screenshot_20171026_002727.png
Screenshot_20171026_004323.png

It should be safe to use in-kernel malloc in a multi-threaded way. If you have a short, complete example code that demonstrates the corruption, others may be able to help you. It may be the case that you have some other defect in your code giving rise to the corruption.

You should also make sure you are not running out of device heap space, but in that case malloc will return a NULL pointer.

Hi txbob,
Thanks. I will try to simplify my program.
I provide some screenshots to describe the issue.
I use breakpoints to monitor the case, so other defects should not be executed to cause the situation, right?

Hi

I just submitted a bug for this, but wanted to follow up here too if anyone else is running into something similar.

There is definitely a bug in the heap management.

Simple example, run on Tesla K80, Cuda 8.0 and Cuda 9.0:

example.cu:
#include <stdio.h>

#define LEN 16384

global void fail()
{
float p1 = (float)malloc(LENsizeof(float));
float p2 = (float)malloc(LEN
sizeof(float));
for (int i = 0; i < LEN; i++) {
p1[i] = 3.14;
}
free(p1);
free(p2);
}

int main(void)
{
size_t lim;
cudaDeviceGetLimit(&lim, cudaLimitMallocHeapSize);
printf("%ld\n", lim);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 610241024*1024UL)
cudaDeviceGetLimit(&lim, cudaLimitMallocHeapSize);
printf("%ld\n", lim);

dim3 threads(16, 16, 1);
dim3 grid(512, 512, 1);
fail<<<grid, threads>>>();
cudaDeviceSynchronize();

}

nvcc -arch sm_37 -o example example.cu

cuda-memcheck ./example

The kernel fail() tries to allocate much more memory than is available on the GPU (each thread allocates 128 KB, and there are 2**26 threads, for a total in the TB range). So many of the device-side malloc() calls will fail. But the pointer returned by malloc() is used without checking. That can’t be good. What do you expect to happen?

Your number of simultaneous threads executing is incorrect. But to make the example provided more direct, the following decreases the allocation size, and checks the pointers - same result.

#include <stdio.h>

#define LEN 512

global void fail()
{
float p1 = (float)malloc(LENsizeof(float));
if (p1 == NULL) {
return;
}
float p2 = (float)malloc(LEN
sizeof(float));
if (p2 == NULL) {
free(p1);
return;
}
for (int i = 0; i < LEN; i++) {
p1[i] = 3.14;
}
free(p1);
free(p2);
}

int main(void)
{
size_t lim;
cudaDeviceGetLimit(&lim, cudaLimitMallocHeapSize);
printf("%ld\n", lim);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 610241024*1024UL);
cudaDeviceGetLimit(&lim, cudaLimitMallocHeapSize);
printf("%ld\n", lim);

dim3 threads(16, 16, 1);
dim3 grid(512, 512, 1);
fail<<<grid, threads>>>();
cudaDeviceSynchronize();

}

when I modify this:

cudaDeviceSetLimit(cudaLimitMallocHeapSize, 610241024*1024UL);

to this:

cudaDeviceSetLimit(cudaLimitMallocHeapSize, 310241024*1024UL);

and run it on a K20X on CUDA 8, it works fine, whether I run it using cuda-memcheck or not.

You say “same result” but I don’t see anywhere that you’ve given a description of what the failing result is.

Given it works on the K20 it might depend on the compute capability? I can’t see how the second version of my code should generate any error at all - no matter the heap size it should just bail on NULLs now.

cuda-memcheck:

========= CUDA-MEMCHECK
========= Out-of-range Shared or Local Address
========= at 0x00000450
========= by thread (15,11,0) in block (90,0,0)
========= Device Frame:fail(void) (fail(void) : 0x70)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (11,12,0) in block (39,0,0)
========= Address 0x234054f720 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x28)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (10,12,0) in block (39,0,0)
========= Address 0x234054dd40 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x28)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (14,11,0) in block (79,0,0)
========= Address 0x233fa55820 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x28)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (2,2,0) in block (25,0,0)
========= Address 0x234104bc00 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x70)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (1,2,0) in block (25,0,0)
========= Address 0x234104b360 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x70)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (14,13,0) in block (12,0,0)
========= Address 0x234104e220 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x28)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x32f6a3]
========= Host Frame:./example [0x37160]
========= Host Frame:./example [0x31c9]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= ERROR SUMMARY: 407 errors