Is it thread-safe to malloc in threads of a kernel function?

Lennoxwu · October 25, 2017, 3:17am

I call malloc in each thread of a kernel function, however I notice a buffer allocated earlier by malloc is corrupted after a new allocation by malloc. The patterns written to the old buffer are special, so I doubt the corruptation is caused by malloc. A value is written in the head of each 80 bytes, and these value increase 0x10000000.

I set a breakpoint at the malloc. When the program is paused at the breakpoint, the data is right. After I execute single step to cross the malloc, then the data is corrupted.(please look the second image).

Is it safe to call malloc in each thread of a kernel? The new malloc return a reasonable address(not 0), so I think the heap is not full. Actually, I am sure the sizes of each buffer is small(<100B) and the number of active threads is few(<30).

I tried set address breakpoint to catch write operation at the address, but nothing happen.

memory before corruption
External Media

memory after corruption
External Media

External Media

Robert_Crovella · October 25, 2017, 2:27pm

It should be safe to use in-kernel malloc in a multi-threaded way. If you have a short, complete example code that demonstrates the corruption, others may be able to help you. It may be the case that you have some other defect in your code giving rise to the corruption.

You should also make sure you are not running out of device heap space, but in that case malloc will return a NULL pointer.

Lennoxwu · October 25, 2017, 4:51pm

Hi txbob,
Thanks. I will try to simplify my program.
I provide some screenshots to describe the issue.
I use breakpoints to monitor the case, so other defects should not be executed to cause the situation, right?

just_another_dev · December 7, 2017, 4:46pm

Hi

I just submitted a bug for this, but wanted to follow up here too if anyone else is running into something similar.

There is definitely a bug in the heap management.

Simple example, run on Tesla K80, Cuda 8.0 and Cuda 9.0:

example.cu:
#include <stdio.h>

#define LEN 16384

global void fail()
{
float p1 = (float)malloc(LENsizeof(float));
float p2 = (float)malloc(LENsizeof(float));
for (int i = 0; i < LEN; i++) {
p1[i] = 3.14;
}
free(p1);
free(p2);
}

int main(void)
{
size_t lim;
cudaDeviceGetLimit(&lim, cudaLimitMallocHeapSize);
printf(“%ld\n”, lim);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 610241024*1024UL)
cudaDeviceGetLimit(&lim, cudaLimitMallocHeapSize);
printf(“%ld\n”, lim);

dim3 threads(16, 16, 1);
dim3 grid(512, 512, 1);
fail<<<grid, threads>>>();
cudaDeviceSynchronize();

}

nvcc -arch sm_37 -o example example.cu

cuda-memcheck ./example

njuffa · December 7, 2017, 7:57pm

The kernel fail() tries to allocate much more memory than is available on the GPU (each thread allocates 128 KB, and there are 2**26 threads, for a total in the TB range). So many of the device-side malloc() calls will fail. But the pointer returned by malloc() is used without checking. That can’t be good. What do you expect to happen?

just_another_dev · December 7, 2017, 8:10pm

Your number of simultaneous threads executing is incorrect. But to make the example provided more direct, the following decreases the allocation size, and checks the pointers - same result.

#include <stdio.h>

#define LEN 512

global void fail()
{
float p1 = (float)malloc(LENsizeof(float));
if (p1 == NULL) {
return;
}
float p2 = (float)malloc(LENsizeof(float));
if (p2 == NULL) {
free(p1);
return;
}
for (int i = 0; i < LEN; i++) {
p1[i] = 3.14;
}
free(p1);
free(p2);
}

int main(void)
{
size_t lim;
cudaDeviceGetLimit(&lim, cudaLimitMallocHeapSize);
printf(“%ld\n”, lim);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 610241024*1024UL);
cudaDeviceGetLimit(&lim, cudaLimitMallocHeapSize);
printf(“%ld\n”, lim);

dim3 threads(16, 16, 1);
dim3 grid(512, 512, 1);
fail<<<grid, threads>>>();
cudaDeviceSynchronize();

}

Robert_Crovella · December 7, 2017, 11:36pm

when I modify this:

cudaDeviceSetLimit(cudaLimitMallocHeapSize, 610241024*1024UL);

to this:

cudaDeviceSetLimit(cudaLimitMallocHeapSize, 310241024*1024UL);

and run it on a K20X on CUDA 8, it works fine, whether I run it using cuda-memcheck or not.

You say “same result” but I don’t see anywhere that you’ve given a description of what the failing result is.

just_another_dev · December 8, 2017, 12:32am

Given it works on the K20 it might depend on the compute capability? I can’t see how the second version of my code should generate any error at all - no matter the heap size it should just bail on NULLs now.

cuda-memcheck:

========= CUDA-MEMCHECK
========= Out-of-range Shared or Local Address
========= at 0x00000450
========= by thread (15,11,0) in block (90,0,0)
========= Device Frame:fail(void) (fail(void) : 0x70)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (11,12,0) in block (39,0,0)
========= Address 0x234054f720 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x28)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (10,12,0) in block (39,0,0)
========= Address 0x234054dd40 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x28)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (14,11,0) in block (79,0,0)
========= Address 0x233fa55820 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x28)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]

…

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (2,2,0) in block (25,0,0)
========= Address 0x234104bc00 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x70)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (1,2,0) in block (25,0,0)
========= Address 0x234104b360 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x70)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Malloc/Free error encountered : Heap corruption
========= at 0x000005f0
========= by thread (14,13,0) in block (12,0,0)
========= Address 0x234104e220 size 0x00000800
========= Device Frame:fail(void) (fail(void) : 0x28)
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x23cd7d]
========= Host Frame:./example [0x1961b]
========= Host Frame:./example [0x36d7e]
========= Host Frame:./example [0x3308]
========= Host Frame:./example [0x3237]
========= Host Frame:./example [0x3242]
========= Host Frame:./example [0x31c4]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/nvidia/libcuda.so.1 [0x32f6a3]
========= Host Frame:./example [0x37160]
========= Host Frame:./example [0x31c9]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21c05]
========= Host Frame:./example [0x3009]

========= ERROR SUMMARY: 407 errors

Topic		Replies	Views
[SOLVED]memory leak issues CUDA Programming and Performance	12	3623	October 12, 2015
cudaHostRegister on multiple threads CUDA Programming and Performance	15	279	June 18, 2024
malloc isn't found when used in a header file CUDA Programming and Performance	9	5086	December 19, 2010
Can we do malloc inside a __global__ function CUDA Programming and Performance	26	9691	February 21, 2010
malloc memory in kernel linked via in/out variable CUDA Programming and Performance	10	1935	October 17, 2015
using cudaMalloc and cudaFree within a loop unspecified launch failure! CUDA Programming and Performance	21	37700	April 23, 2009
Stack overflow error with compute-sanitizer CUDA Programming and Performance ubuntu	6	1026	October 4, 2023
CUDA + CPU threads CUDA Programming and Performance	5	11663	August 20, 2008
Thread safety of cudaMalloc and cudaFree with multiple GPUs CUDA Programming and Performance	3	1887	December 4, 2020
The result of device code is wrong when without a "printf" CUDA Programming and Performance	8	2013	October 22, 2016

Is it thread-safe to malloc in threads of a kernel function?

Related topics