Distance Transform CUDA problem

I’m trying to perform the distance transform of an image on GPU my code is:

//This kernel is just to prepare the matrix containing the distance transform, I put 0 (in distTransform matrix) to the points from where I want to compute the distances

__global__ void kernelPrepareDistTransf(u_char *matrix, float *distTransform, int row, int col)

{

	int idxI = blockIdx.y * blockDim.y + threadIdx.y;

	int idxJ = blockIdx.x * blockDim.x + threadIdx.x;

	while(idxI < row)

	{

		while(idxJ < col)

		{

			if ( matrix[idxI*col+idxJ] == 255 )

			{

				distTransform[idxI*col+idxJ] = 0;

			}

			else

			{

				distTransform[idxI*col+idxJ] = 255;

			}

			idxJ += blockDim.x*gridDim.x;

		}

		idxI += blockDim.y*gridDim.y;

		idxJ = blockIdx.x * blockDim.x + threadIdx.x;

	}

}

//Here is where the real distance is computed, for each pixel in the image that isn't equal to 0

__global__ void kernelDistanceTransform(u_char *matrix, float *distTransform, int row, int col)

{

	int idxI = blockIdx.y * blockDim.y + threadIdx.y;

	int idxJ = blockIdx.x * blockDim.x + threadIdx.x;

	int idxK, idxL;

	float distance;

	while(idxI < row)

	{

		while(idxJ < col)

		{

			if ( distTransform[idxI*col+idxJ] == 0 )

			{

				for (idxK = 0; idxK < row; idxK++)

				{

					for (idxL = 0; idxL < col; idxL++)

					{

						if ( distTransform[idxI*col+idxJ] != 0 ){

							distance = abs(idxK-idxI) + abs(idxL-idxJ);

							if ( distTransform[idxK*col+idxL] > distance )

							{

								distTransform[idxK*col+idxL] = distance;

							}

						}

					}

				}

			}

			idxJ += blockDim.x*gridDim.x;

		}

		idxI += blockDim.y*gridDim.y;

		idxJ = blockIdx.x * blockDim.x + threadIdx.x;

	}

}

void pprDistanceTransform(pprMatrix *matrix, pprMatrixf *distTransform, u_char type)

{

	u_char *d_matrix;

	float *d_transform;

	cudaEvent_t start, stop;

	float elapsedTime;

	dim3 blocks(3,3);

	dim3 threads(10,10);

	

	distTransform->row = matrix->row;

	distTransform->col = matrix->col;

	pprMatrixfMem(distTransform);

	

	//Create start, stop events.

	CHECK_ERROR( cudaEventCreate( &start ) );

	CHECK_ERROR( cudaEventCreate( &stop ) );

	

	//Launch the start event.

	CHECK_ERROR( cudaEventRecord( start, 0 ) );

	

	//Allocate memory on the GPU.

	CHECK_ERROR( cudaMalloc((void**)&d_matrix, sizeof(u_char)*matrix->row*matrix->col) );

	CHECK_ERROR( cudaMalloc((void**)&d_transform, sizeof(float)*matrix->row*matrix->col) );

	

	//Copy information from CPU(host) to GPU(device).

	CHECK_ERROR( cudaMemcpy( d_matrix, matrix->data, sizeof(u_char)*matrix->row*matrix->col, cudaMemcpyHostToDevice ) );

	

	//Launch the kernel computation to create the distance transform.

	kernelPrepareDistTransf<<<blocks,threads>>>(d_matrix, d_transform, matrix->row, matrix->col);

	CHECK_ERROR( cudaThreadSynchronize() );

	kernelDistanceTransform<<<blocks,threads>>>(d_matrix, d_transform, matrix->row, matrix->col);

	CHECK_ERROR( cudaThreadSynchronize() );	

	//Copy information from GPU(device) to CPU(host).

	CHECK_ERROR( cudaMemcpy( distTransform->data, d_transform, sizeof(float)*matrix->row*matrix->col, cudaMemcpyDeviceToHost ) );	

	

	//Launch the stop event.

	CHECK_ERROR( cudaEventRecord( stop, 0 ) );

	CHECK_ERROR( cudaEventSynchronize( stop ) );

	

	//Print the elapsed time.

	CHECK_ERROR( cudaEventElapsedTime( &elapsedTime, start, stop ) );

	printf( "%3.1f ms\n", elapsedTime );

	

	//Destroy the events.

	CHECK_ERROR( cudaEventDestroy( start ) );

	CHECK_ERROR( cudaEventDestroy( stop ) );

	

	//Free the allocated memory on GPU.

	cudaFree(d_matrix);

	cudaFree(d_transform);

}

But the problem is that the kernel kernelDistanceTransform seems not to be working as the result I have is the same as if I was launching just the kernelPrepareDistTransf

I need to compute the distance transform on GPU since on CPU its taking too much time … on images of 2 MP it takes 20 min so I hope with GPU’s help I can achieved it faster.

Well I find out the problem it had something to do with my test code sorry…

And in fact the code is running faster than CPU but there’s another problem, when I try to run the algorithm on images bigger or equal to 2 Megapixels these errors occurs

––CUDA error: cudaThreadSynchronize() returned “the launch timed out and was terminated”

––CUDA error: cudaMemcpy( distTransform->data, d_transform, sizeof(float)matrix->rowmatrix->col, cudaMemcpyDeviceToHost ) returned “the launch timed out and was terminated”

––CUDA error: cudaEventRecord( stop, 0 ) returned “the launch timed out and was terminated” ––CUDA error: cudaEventSynchronize( stop ) returned “the launch timed out and was terminated” ––CUDA error: cudaEventElapsedTime( &elapsedTime, start, stop ) returned “the launch timed out and was terminated” 0.0 ms

––CUDA error: cudaEventDestroy( start ) returned “the launch timed out and was terminated”

––CUDA error: cudaEventDestroy( stop ) returned “the launch timed out and was terminated”

I read something about that Nvidia prevents the access to global memory many times and indeed this algorithm takes 1 pixel and then computes the distance to all other pixels so I don’t know if I’ll be able to compute this over big images

Ideas?

Thanks !

Hi.

I still have the “launch time out” error message. I had another idea on how can I solve the algorithm but still that error.

My first attempt was like follows:

    [*]Each thread launched with the kernel check if its value in distTransf is zero

    [*]if it is then compute the distance from this pixel to all the other pixels that are not zero.

in this approach I had 2-fors inside a thread to go all over the matrix and measure the distance, too much work for every single thread, isn’t it?

Then my second approach is as follows:

    [*]On the host-side I check if the value is zero

    [*]if it is then launch a kernel where every thread measures the distance between the pixel gotten from host-side to the pixel of the thread.

in this aprroach I had 2-fors outside the device-code so I think this would be much faster and would avoid time out error but NOPE External Image here I put my code to see if someone can give me any ideas

__global__ void kernelPrepareDistTransf(u_char *matrix, float *distTransform, int row, int col)

{

	int idxI = blockIdx.y * blockDim.y + threadIdx.y;

	int idxJ = blockIdx.x * blockDim.x + threadIdx.x;

	while(idxI < row)

	{

		while(idxJ < col)

		{

			if ( matrix[idxI*col+idxJ] == 255 )

			{

				distTransform[idxI*col+idxJ] = 0;

			}

			else

			{

				distTransform[idxI*col+idxJ] = 125;

			}

			idxJ += blockDim.x*gridDim.x;

		}

		idxI += blockDim.y*gridDim.y;

		idxJ = blockIdx.x * blockDim.x + threadIdx.x;

	}

}

__global__ void kernelDistanceTransform(float *distTransform, int row, int col, int idxI, int idxJ)

{

	int idxK = blockIdx.y * blockDim.y + threadIdx.y;

	int idxL = blockIdx.x * blockDim.x + threadIdx.x;

	float distance;

	while(idxK < row)

	{

		while(idxL < col)

		{

			if ( distTransform[idxK*col+idxL] != 0 ){

				distance = abs(idxK-idxI) + abs(idxL-idxJ);

				if ( distTransform[idxK*col+idxL] > distance )

					distTransform[idxK*col+idxL] = distance;

			}

			idxJ += blockDim.x*gridDim.x;

		}

		idxI += blockDim.y*gridDim.y;

		idxJ = blockIdx.x * blockDim.x + threadIdx.x;

	}

}

void pprDistanceTransform(pprMatrix *matrix, pprMatrixf *distTransform, u_char type)

{

	u_char *d_matrix;

	float *d_transform, elapsedTime;

	int idxI, idxJ;

	cudaEvent_t start, stop;

	dim3 blocks(2,3);

	dim3 threads(16,16);

	

	distTransform->row = matrix->row;

	distTransform->col = matrix->col;

	pprMatrixfMem(distTransform);

	//Create start, stop events.

	CHECK_ERROR( cudaEventCreate( &start ) );

	CHECK_ERROR( cudaEventCreate( &stop ) );

	

	//Launch the start event.

	CHECK_ERROR( cudaEventRecord( start, 0 ) );

	

	//Allocate memory on the GPU.

	CHECK_ERROR( cudaMalloc((void**)&d_matrix, sizeof(u_char)*matrix->row*matrix->col) );

	CHECK_ERROR( cudaMalloc((void**)&d_transform, sizeof(float)*matrix->row*matrix->col) );

	

	//Copy information from CPU(host) to GPU(device).

	CHECK_ERROR( cudaMemcpy( d_matrix, matrix->data, sizeof(u_char)*matrix->row*matrix->col, cudaMemcpyHostToDevice ) );

	

	//Mark the spots from where we are going to measure the distances

	kernelPrepareDistTransf<<<blocks,threads>>>(d_matrix, d_transform, matrix->row, matrix->col);

	CHECK_ERROR( cudaThreadSynchronize() );

	//Computate the distance transform.

	for (idxI = 0; idxI < matrix->row; idxI++){

		for (idxJ = 0; idxJ < matrix->col; idxJ++){

			if ( matrix->data[idxI*matrix->col+idxJ] == 255 ){

				kernelDistanceTransform<<<blocks,threads>>>(d_transform, matrix->row, matrix->col, idxI, idxJ);

				CHECK_ERROR( cudaThreadSynchronize() );	

			}

		}

	}

	//Copy information from GPU(device) to CPU(host).

	CHECK_ERROR( cudaMemcpy( distTransform->data, d_transform, sizeof(float)*matrix->row*matrix->col, cudaMemcpyDeviceToHost ) );

	//Launch the stop event.

	CHECK_ERROR( cudaEventRecord( stop, 0 ) );

	CHECK_ERROR( cudaEventSynchronize( stop ) );

	

	//Print the elapsed time.

	CHECK_ERROR( cudaEventElapsedTime( &elapsedTime, start, stop ) );

	printf( "%3.1f ms\n", elapsedTime );

	

	//Destroy the events.

	CHECK_ERROR( cudaEventDestroy( start ) );

	CHECK_ERROR( cudaEventDestroy( stop ) );

	

	//Free the allocated memory on GPU.

	cudaFree(d_matrix);

	cudaFree(d_transform);

}

This kernel will be slow on any device, as it contains infinite loops (you check [font=“Courier New”]idxK[/font] and [font=“Courier New”]idxL[/font] but increment [font=“Courier New”]idxI[/font] and [font=“Courier New”]idxJ[/font]).

Using [font=“Courier New”]for[/font] loops would not only make it much more easy to spot this, but also helps the compiler to optimize the code as it is the standard idiom that about everybody else would use in this case.

This whole kernel does not do anything as you have two contradicting [font=“Courier New”]if[/font] conditionals.

OH THANK YOU !!! I didn’t see that…now is working.

I am interested in the DT. Do you have your code publicly available e.g. in Github?

fast CPU implementations of 2D distance transform are highly serial, and therefore not really amenable for a direct GPU port. In fact, a fast 2D distance transform implementaiton in CUDA could be quite complicated to think out and implement.

There is some work - see paper’Generalized Distance Transforms and Skeletons in Graphics’ by Strodzka and Tela.

There is a pretty simple and GPU friendly approach described here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.8568&rep=rep1&type=pdf

Nice, seems to be a useful computational paradigm on the GPU.

I am also interested in the DT. Do you have your code publicly available e.g. in Github?