Problem with cudaMalloc

bugBot · October 28, 2008, 7:36am

Hi,

I have a problem with cudaMalloc in my kernel. I have the following piece of code:-

main() {
…
for(i=0; i<TIMES; i++) {
kernel_wrapper();
}
…
…
}
// End main

// Start kernerl wrapper.
kernel_wrapper() {
…
…
// par_output is a N+1 * N+1 (host) array which is initialized.
CUT_SAFE_CALL(cudaMalloc((void**)&output, sizeof(float)(N+1)(N+1)));
CUT_SAFE_CALL(cudaMalloc((void**)&op1, sizeof(float)(N+1)(N+1)));
// ** SEGFAULTING HERE. BECAUSE THE MALLOC DID NOT SUCCEED I BELIEVE.
CUT_SAFE_CALL(cudaMemcpy(output, par_output,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice)
);
CUT_SAFE_CALL(cudaMemcpy(op1, par_output,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice)
);

for(int k=0;k<M;k++) {
actualKernel<<< grid, threads >>>(N, M, output, op1, h, blx, bly, TBx, TBy);
// I am passing output and op1 which are the N+1 * N+1 arrays to this kernel for some computation.
cudaMemcpy(op1, output,sizeof(float)(N+1)(N+1),cudaMemcpyDeviceToDevice);
}
…
CUT_SAFE_CALL(cudaFree(output));
CUT_SAFE_CALL(cudaFree(op1));
}
}

Note that I am calling kernel_wrapper multiple times. There is no problem at all when the kernel_wrapper executes for the first two time. But when it executes 3rd time, I get a segfault at ** (please see code comment above). Some other points to note are:

(Both) The cudaMalloc returns 0 for first 2 executions of kernel_wrapper. For the third execution it returns a value of 4. What does this mean?
This problem happens only for values for N >= 256 (The previous value 128 and lesser values work fine for any number of calls to kernerl_Wrapper)
If I comment out the call to actualKernel (the kernel itself), there is no segfaulting.
I have not posted the entire code because it is not tidy and is tough to understand.
I am using CUDA 2.0 on a linux machine.

Any help will be appreciated. Thanks!

QD4_33 · October 28, 2008, 11:52am

I use a function like…

void checkCudaError(const char *msg)

{

	cudaError_t err = cudaGetLastError();

	if( cudaSuccess != err)

	{

		fprintf(stderr, "CUDA error> %s %s.\n", msg, cudaGetErrorString( err) );

		exit(EXIT_FAILURE);

	}

}

… to find out the last error. Run this function on different places in your code, because it only shows the last error, but the last error can be caused by another one…

I think you are out of memory after first reading your code. The host thread does not wait for your kernel. The kernel return immediately to the host thread without a result. Perhaps you allocate to much memory in parallel. You can use cudaThreadSynchronize to avoid this behavior.

You can check your free memory with cuMemGetInfo.

bugBot · October 29, 2008, 1:28am

Thanks for the tip!

I do have a cudathreadSynchronize() in the kernel_wrapper function as below:

for(int k=0;k<M;k++) {
actualKernel<<< grid, threads >>>(N, M, output, op1, h, blx, bly, TBx, TBy);
// I am passing output and op1 which are the N+1 * N+1 arrays to this kernel for some computation.
cudaMemcpy(op1, output,sizeof(float)(N+1)(N+1),cudaMemcpyDeviceToDevice);

cudathreadSynchronize();
}

I used the checkCudaError() and I was able to pin down the problem. I get a “unspecified launch failure” just after the call to the cudaThreadSynchronize(). Is this related to this thread? - [url=“http://forums.nvidia.com/index.php?showtopic=42785”]http://forums.nvidia.com/index.php?showtopic=42785[/url]

I also tried removing the cudaThreadSynchronize() from my code. (as it is not necessary) I get the same “unspecified launch failure” after the device to host copy as below. (see “*** FAILURE AFTER THIS FUNCTION**” comment towards the end of the code)

main() {
…
for(i=0; i<TIMES; i++) {
kernel_wrapper();
}
…
…
}
// End main

// Start kernerl wrapper.
kernel_wrapper() {
…
…
// par_output is a N+1 * N+1 (host) array which is initialized.
CUT_SAFE_CALL(cudaMalloc((void**)&output, sizeof(float)(N+1)(N+1)));
CUT_SAFE_CALL(cudaMalloc((void**)&op1, sizeof(float)(N+1)(N+1)));
// ** SEGFAULTING HERE. BECAUSE THE MALLOC DID NOT SUCCEED I BELIEVE.
CUT_SAFE_CALL(cudaMemcpy(output, par_output,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice)
);
CUT_SAFE_CALL(cudaMemcpy(op1, par_output,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice)
);

for(int k=0;k<M;k++) {
actualKernel<<< grid, threads >>>(N, M, output, op1, h, blx, bly, TBx, TBy);
// I am passing output and op1 which are the N+1 * N+1 arrays to this kernel for some computation.
cudaMemcpy(op1, output,sizeof(float)(N+1)(N+1),cudaMemcpyDeviceToDevice);
}

cudaMemcpy(par_output,output,sizeof(float)(N+1)(N+1),cudaM
emcpyDeviceToHost); // *** FAILURE AFTER THIS FUNCTION***

…
CUT_SAFE_CALL(cudaFree(output));
CUT_SAFE_CALL(cudaFree(op1));
}
}

Any help will be greatly appreciated. Thanks!

tmurray · October 29, 2008, 1:43am

Unspecified launch error is a fancy way of saying “you have a segfault in your kernel somewhere.” Compile with -deviceemu and run in valgrind.

bugBot · October 29, 2008, 6:15am

Found the problem! It was indeed because of a seg fault in the kernel. Thanks a lot! Am wondering why I am not getting a seg fault though. I thought I used to get a seg fault earlier for such errors.

Topic		Replies	Views
cudaMalloc segfaulting Possible cause? CUDA Programming and Performance	7	4032	September 26, 2008
kernel only executes successfully once, then cudaMemcpy segfaults CUDA Programming and Performance	2	3179	March 31, 2009
Segmentation Fault on calling cudaMalloc - I can't figure out why CUDA Programming and Performance	1	2059	November 12, 2015
what's wrong with cudaMalloc ? CUDA Programming and Performance	1	1588	March 26, 2010
cudaFree, segmentation fault CUDA Programming and Performance	4	3640	July 29, 2009
Runtime API error 4: unspecified launch failure on cudaMalloc CUDA Programming and Performance	0	11869	July 28, 2011
CudaMalloc? CUDA Programming and Performance	11	9598	December 14, 2010
cudaMalloc in cuda 3.0, Segmentation fault on cudaMalloc CUDA Programming and Performance	0	874	December 1, 2010
cudaMalloc() leads to segment fault Jetson TX1	9	4613	June 30, 2017
second kernel call results in segmentation fault and other annoying problems CUDA Programming and Performance	6	2187	March 15, 2009

Problem with cudaMalloc

Related topics