How to structure the code using Cuda C++

Hello,

I would like to accelerate a C++ program with Cuda. I don’t really know how to structure this. I’ll explain:

//main.cpp
void fnAlgoChain()
{
	//**********************************************************
	// Clean Image
	m_FluoroFilter-><b>fnCleanImage</b>(m_FluoroFilter->getCimagefDst());
	m_FluoroFilter->fnXrayRegul(m_FluoroFilter->getCimagefDst());

	//**********************************************************
	// pre filtering

The fnAlgoChain function is called in the main function. First of all, I would like to create a kernel which is equivalent to the method “fnCleanImage” and includ this kernel in the fnAlgoChain function. It would look like something like that:

//main.cpp 
void fnAlgoChain()
{
	//**********************************************************
	// Clean Image
	m_FluoroFilter-><b>LAUNCH_KERNELfnCleanImage</b>(m_FluoroFilter->getCimagefDst());

Do you think that will work ?

The LAUNCH_KERNELfnCleanImage will look like this:

//kernel.cu
bool LAUNCH_KERNELfnCleanImage(CImg<float>& d_fpIn)
{
	
	dim3 Db(..,..), Dg(..,..); // don't know yet
	KERNEL_offset<<<Dg,Db>>>(d_fpIn, d_ucOff);

  return TestLastError("LAUNCH_KERNELfnCleanImage");
}

Here is the content of fnCleanImage:

//Filter.cpp
void CADFGB1_filter::fnCleanImage(CImg<float>& fpIn)
{
	_offset(fpIn,m_CimagefRefOffset);
	_gain(fpIn,m_CimagefRefCompressedGain);
	_defect(fpIn,m_CimagefRefDM);
}

So this method is calling 3 others methods. I’d like to accelerate them too, here is just the content of the method _offset:

//Filter.cpp
void CADFGB1_filter::_offset(CImg<float>& fpIn, CImg<unsigned short>& ucOff)
{
	int nY = fpIn.height();
	int nX = fpIn.width();
	float* imIn  = static_cast<float*>(fpIn.data());
	unsigned short* imRef = static_cast<unsigned short*>(ucOff.data());
	for (int y=0; y<nY; y++) {
		int niY = y*nX;
		for (int x=0; x<nX; x++) {
			imIn[x+niY] = imIn[x+niY] - static_cast<float>(imRef[x+niY]);
		}
	}
}

I wrote a offset kernel, is it necessary to put device infront of int Ny, int nX, float imIn, unsigned short imRef ? (because the kernel will be executed on the device):

//kernel.cu
__global__ void KERNEL_offset(CImg<float>& Input, CImg<unsigned short>& Off)
{
	int nY = Input.height(); //__device__  ??
	int nX = Input.width();  //__device__  ??
	float* imIn  = static_cast<float*>(Input.data());  //__device__  ??
	unsigned short* imRef = static_cast<unsigned short*>(Off.data());  //__device__  ??


	int i=blockDim.x*blockIdx.x+threadIdx.x;
	int j=blockDim.x*blockIdx.x+threadIdx.x;
	
	if((i<nY) || (j<nX) return;

	int niY=i*nX;  //__device__  ??
	imIn[j+niY] = imIn[j+niY] - static_cast<float>(imRef[j+niY]);
;
}

data() will be called in the kernel whereas it is a host function defined in CImg.h. Is it a good idea to put host device in front of the function (in CImg.h) ?

Also is it possible to create a kernel which call 3 others ?

//kernel.cu
__global__ void KERNELfnCleanImage(CImg<float>& fpIn, CImg<unsigned short>& ucOff)
{
KERNEL_offset<<<Dg,Db>>>(d_fpIn,d_m_CimagefRefOffset);
KERNEL_gain<<<Dg,Db>>>(d_fpIn,d_m_CimagefRefCompressedGain);
KERNEL_defect<<<Dg,Db>>>(d_fpIn,d_m_CimagefRefDM);
}

Thanks !

The only question I understood above was this one:

Also is it possible to create a kernel which call 3 others ?

yes, this is possible using CUDA Dynamic Parallelism. There is documentation and also sample projects covering this.

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism

look for any of the sample codes that start with cdp