Problem with cudaMalloc

Hi,

I have a problem with cudaMalloc in my kernel. I have the following piece of code:-

main() {

for(i=0; i<TIMES; i++) {
kernel_wrapper();
}


}
// End main

// Start kernerl wrapper.
kernel_wrapper() {


// par_output is a N+1 * N+1 (host) array which is initialized.
CUT_SAFE_CALL(cudaMalloc((void**)&output, sizeof(float)(N+1)(N+1)));
CUT_SAFE_CALL(cudaMalloc((void**)&op1, sizeof(float)(N+1)(N+1)));
// ** SEGFAULTING HERE. BECAUSE THE MALLOC DID NOT SUCCEED I BELIEVE.
CUT_SAFE_CALL(cudaMemcpy(output, par_output,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice)
);
CUT_SAFE_CALL(cudaMemcpy(op1, par_output,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice)
);

for(int k=0;k<M;k++) {
actualKernel<<< grid, threads >>>(N, M, output, op1, h, blx, bly, TBx, TBy);
// I am passing output and op1 which are the N+1 * N+1 arrays to this kernel for some computation.
cudaMemcpy(op1, output,sizeof(float)(N+1)(N+1),cudaMemcpyDeviceToDevice);
}

CUT_SAFE_CALL(cudaFree(output));
CUT_SAFE_CALL(cudaFree(op1));
}
}

Note that I am calling kernel_wrapper multiple times. There is no problem at all when the kernel_wrapper executes for the first two time. But when it executes 3rd time, I get a segfault at ** (please see code comment above). Some other points to note are:

  • (Both) The cudaMalloc returns 0 for first 2 executions of kernel_wrapper. For the third execution it returns a value of 4. What does this mean?
  • This problem happens only for values for N >= 256 (The previous value 128 and lesser values work fine for any number of calls to kernerl_Wrapper)
  • If I comment out the call to actualKernel (the kernel itself), there is no segfaulting.
  • I have not posted the entire code because it is not tidy and is tough to understand.
  • I am using CUDA 2.0 on a linux machine.

Any help will be appreciated. Thanks!

I use a function like…

void checkCudaError(const char *msg)

{

	cudaError_t err = cudaGetLastError();

	if( cudaSuccess != err)

	{

		fprintf(stderr, "CUDA error> %s %s.\n", msg, cudaGetErrorString( err) );

		exit(EXIT_FAILURE);

	}

}

… to find out the last error. Run this function on different places in your code, because it only shows the last error, but the last error can be caused by another one…

I think you are out of memory after first reading your code. The host thread does not wait for your kernel. The kernel return immediately to the host thread without a result. Perhaps you allocate to much memory in parallel. You can use cudaThreadSynchronize to avoid this behavior.

You can check your free memory with cuMemGetInfo.

Thanks for the tip!

I do have a cudathreadSynchronize() in the kernel_wrapper function as below:

for(int k=0;k<M;k++) {
actualKernel<<< grid, threads >>>(N, M, output, op1, h, blx, bly, TBx, TBy);
// I am passing output and op1 which are the N+1 * N+1 arrays to this kernel for some computation.
cudaMemcpy(op1, output,sizeof(float)(N+1)(N+1),cudaMemcpyDeviceToDevice);

cudathreadSynchronize();
}

I used the checkCudaError() and I was able to pin down the problem. I get a “unspecified launch failure” just after the call to the cudaThreadSynchronize(). Is this related to this thread? - [url=“http://forums.nvidia.com/index.php?showtopic=42785”]http://forums.nvidia.com/index.php?showtopic=42785[/url]

I also tried removing the cudaThreadSynchronize() from my code. (as it is not necessary) I get the same “unspecified launch failure” after the device to host copy as below. (see “*** FAILURE AFTER THIS FUNCTION**” comment towards the end of the code)

main() {

for(i=0; i<TIMES; i++) {
kernel_wrapper();
}


}
// End main

// Start kernerl wrapper.
kernel_wrapper() {


// par_output is a N+1 * N+1 (host) array which is initialized.
CUT_SAFE_CALL(cudaMalloc((void**)&output, sizeof(float)(N+1)(N+1)));
CUT_SAFE_CALL(cudaMalloc((void**)&op1, sizeof(float)(N+1)(N+1)));
// ** SEGFAULTING HERE. BECAUSE THE MALLOC DID NOT SUCCEED I BELIEVE.
CUT_SAFE_CALL(cudaMemcpy(output, par_output,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice)
);
CUT_SAFE_CALL(cudaMemcpy(op1, par_output,sizeof(float)(N+1)(N+1),cudaMemcpyHostToDevice)
);

for(int k=0;k<M;k++) {
actualKernel<<< grid, threads >>>(N, M, output, op1, h, blx, bly, TBx, TBy);
// I am passing output and op1 which are the N+1 * N+1 arrays to this kernel for some computation.
cudaMemcpy(op1, output,sizeof(float)(N+1)(N+1),cudaMemcpyDeviceToDevice);
}

cudaMemcpy(par_output,output,sizeof(float)(N+1)(N+1),cudaM
emcpyDeviceToHost); // *** FAILURE AFTER THIS FUNCTION***


CUT_SAFE_CALL(cudaFree(output));
CUT_SAFE_CALL(cudaFree(op1));
}
}

Any help will be greatly appreciated. Thanks!

Unspecified launch error is a fancy way of saying “you have a segfault in your kernel somewhere.” Compile with -deviceemu and run in valgrind.

Found the problem! It was indeed because of a seg fault in the kernel. Thanks a lot! Am wondering why I am not getting a seg fault though. I thought I used to get a seg fault earlier for such errors.