cuda kernel call within for loop gets slow, crashes

ellensn · March 23, 2012, 4:44pm

Hi-

I’m fairly inexperienced with CUDA so I hope this is an easy fix. However, I really can’t figure it out. I have a kernel I call from a for loop that works fine for small data sets but, for larger data sets, gets slower over time and even totally freezes my computer.

In brief, here is a sketch of the host side of things with many of the repetitive parameters removed:

//The function that calls the kernel, called iteratively after a few other changes are made each cycle in the main program

void runTimeStep(const float* h_T1, const float* h_Qr, const char* h_Domain, ..., float* h_T0){

	float* d_kappa, ...;

	...

	char* d_Domain;

	

	//allocate parameters

	cutilSafeCall( cudaMalloc( (void**) &d_kappa, sizeof(float)*kappa.length()));

//and some more parameters...

	//allocate arrays as 1D

	cutilSafeCall( cudaMalloc((void**) &d_T1, sizeof(float)*(Nx*Ny*Nz)));

...

	//copy params

	cutilSafeCall( cudaMemcpy( d_kappa, h_kappa, sizeof(float)*kappa.length(), cudaMemcpyHostToDevice));

...

	//copy 1D arrays

	cutilSafeCall( cudaMemcpy( d_T1, h_T1, sizeof(float)*(Nx*Ny*Nz), cudaMemcpyHostToDevice));

...

	//output

	float* d_T0;

	cutilSafeCall( cudaMalloc((void**) &d_T0, sizeof(float)*(Nx*Ny*Nz)));

	cutilSafeCall( cudaMemcpy( d_T0, h_T0, sizeof(float)*(Nx*Ny*Nz), cudaMemcpyHostToDevice));

	// set up execution parameters

        dim3  grid( (Nx/16)+1, (Ny/28)+1, 1);

        dim3  threads( 16, 28, 1);

	onetimestep<<<grid,threads>>> (d_T1,d_Qr,d_Domain,d_dh2,d_kappa, ..., d_T0);

	

	cudaMemcpy( h_T0, d_T0, sizeof(float)*(Nx*Ny*Nz), cudaMemcpyDeviceToHost); //result!

	cudaFree(d_kappa);

        cudaFree(d_T1);

...

	cudaFree(d_T0);

	cutilDeviceReset();

}

…and here is the kernel itself, again modified for brevity:

#include <stdio.h>

__global__ void onetimestep ( const float* T1, ..., float* T2 ) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.y*blockDim.y + threadIdx.y;

	int ind,...;

...

	float GradPlusLap;

	float Perfusion;

	for(int k=0; k < (*Nz); k++) {

		ind = k*Nxy + j * (*Ny) + i;

		if(i > 1 && j > 1 && k > 1 && i < (*Nx)-2 && j < (*Ny)-2 && k < (*Nz)-2) {

	        GradPlusLap = ...;

		

		Perfusion = ...;

			

		T2[ind] = T1[ind] + *dt  * (GradPlusLap - Perfusion + (...)) /

			 (...);

			}

		else{

			T2[ind] = T1[ind];

		}//End if

	} //End k

}

I’m confident that the kernel itself works as it gives me the proper result for small data sets. Do I have a memory leak? I know I cudaFree() all my device variables, and do a cutilDeviceReset() following each call.

Any help would be much appreciated.

rikm · March 27, 2012, 3:23am

I’ve read that kernel launches are asynch, so I would expect to have to do a device wait after the kernel launch in order to be sure all the device threads are done before copying, freeing and resetting.

tera · March 27, 2012, 9:46am

Take the memory allocation and freeing out of the loop, and particularly the device reset as well. These are expensive operations.

ellensn · March 30, 2012, 3:25pm

Thanks, all. I eliminated the crashing and the progressing delay by removing the cutilSafeCall() but my attempts to remove the memory from the loop has, as yet, failed. I get errors when I try to call the memory–much of which remains unchanged–from a different kernel call. Is there a resource/example that anyone knows about for accessing previously transferred memory?

Thanks again.

seibert · April 1, 2012, 11:52am

Removing the cutilSafeCall() probably just means you have stopped checking the return values on the CUDA function and are not finding the errors returned by CUDA until they cause a crash later. I would not consider that a fix just yet.

As long as you do not cudaFree a device pointer before passing it to another kernel, the memory region should continue to be valid. (Assuming the pointer was valid to begin with. This is why you have to check the return values from cudaMalloc().) No special steps are required to continue using a device pointer.

pasoleatis · April 1, 2012, 1:08pm

cudamemcpy is a blocking function. Both device and host are blocked when until the copying is complete. Unless you are using openmp or mpi there is no need to add extra synchronisation.

Topic		Replies	Views
Will calling a kernel from a kernel help on performance? CUDA Programming and Performance	9	2195	December 23, 2022
incomprehensible behaviour limitations on kernel calls for host function? CUDA Programming and Performance	12	7031	April 28, 2011
kernel only executes successfully once, then cudaMemcpy segfaults CUDA Programming and Performance	2	3167	March 31, 2009
Looping kernel calls Unspecified launch error on cudaFree() ?? CUDA Programming and Performance	5	1723	May 13, 2009
Kernel dimension influences cudaMemcpy? CUDA Programming and Performance	4	2415	September 26, 2007
My program is going to slow. CUDA problem? CUDA Programming and Performance	9	1632	June 3, 2014
memory access error CUDA Programming and Performance	11	1402	January 12, 2013
Kernel can not run parallelly with CPU codes OK in XP, failed in Vista CUDA Programming and Performance	4	8689	December 5, 2008
time of kernel 100ms to 14ms i dont why CUDA Programming and Performance	6	1160	February 11, 2011
free kernel code after execution CUDA Programming and Performance	8	4779	June 23, 2012

cuda kernel call within for loop gets slow, crashes

Related topics