Hello,
I would like to accelerate a C++ program with Cuda. I don’t really know how to structure this. I’ll explain:
//main.cpp
void fnAlgoChain()
{
//**********************************************************
// Clean Image
m_FluoroFilter-><b>fnCleanImage</b>(m_FluoroFilter->getCimagefDst());
m_FluoroFilter->fnXrayRegul(m_FluoroFilter->getCimagefDst());
//**********************************************************
// pre filtering
The fnAlgoChain function is called in the main function. First of all, I would like to create a kernel which is equivalent to the method “fnCleanImage” and includ this kernel in the fnAlgoChain function. It would look like something like that:
//main.cpp
void fnAlgoChain()
{
//**********************************************************
// Clean Image
m_FluoroFilter-><b>LAUNCH_KERNELfnCleanImage</b>(m_FluoroFilter->getCimagefDst());
Do you think that will work ?
The LAUNCH_KERNELfnCleanImage will look like this:
//kernel.cu
bool LAUNCH_KERNELfnCleanImage(CImg<float>& d_fpIn)
{
dim3 Db(..,..), Dg(..,..); // don't know yet
KERNEL_offset<<<Dg,Db>>>(d_fpIn, d_ucOff);
return TestLastError("LAUNCH_KERNELfnCleanImage");
}
Here is the content of fnCleanImage:
//Filter.cpp
void CADFGB1_filter::fnCleanImage(CImg<float>& fpIn)
{
_offset(fpIn,m_CimagefRefOffset);
_gain(fpIn,m_CimagefRefCompressedGain);
_defect(fpIn,m_CimagefRefDM);
}
So this method is calling 3 others methods. I’d like to accelerate them too, here is just the content of the method _offset:
//Filter.cpp
void CADFGB1_filter::_offset(CImg<float>& fpIn, CImg<unsigned short>& ucOff)
{
int nY = fpIn.height();
int nX = fpIn.width();
float* imIn = static_cast<float*>(fpIn.data());
unsigned short* imRef = static_cast<unsigned short*>(ucOff.data());
for (int y=0; y<nY; y++) {
int niY = y*nX;
for (int x=0; x<nX; x++) {
imIn[x+niY] = imIn[x+niY] - static_cast<float>(imRef[x+niY]);
}
}
}
I wrote a offset kernel, is it necessary to put device infront of int Ny, int nX, float imIn, unsigned short imRef ? (because the kernel will be executed on the device):
//kernel.cu
__global__ void KERNEL_offset(CImg<float>& Input, CImg<unsigned short>& Off)
{
int nY = Input.height(); //__device__ ??
int nX = Input.width(); //__device__ ??
float* imIn = static_cast<float*>(Input.data()); //__device__ ??
unsigned short* imRef = static_cast<unsigned short*>(Off.data()); //__device__ ??
int i=blockDim.x*blockIdx.x+threadIdx.x;
int j=blockDim.x*blockIdx.x+threadIdx.x;
if((i<nY) || (j<nX) return;
int niY=i*nX; //__device__ ??
imIn[j+niY] = imIn[j+niY] - static_cast<float>(imRef[j+niY]);
;
}
data() will be called in the kernel whereas it is a host function defined in CImg.h. Is it a good idea to put host device in front of the function (in CImg.h) ?
Also is it possible to create a kernel which call 3 others ?
//kernel.cu
__global__ void KERNELfnCleanImage(CImg<float>& fpIn, CImg<unsigned short>& ucOff)
{
KERNEL_offset<<<Dg,Db>>>(d_fpIn,d_m_CimagefRefOffset);
KERNEL_gain<<<Dg,Db>>>(d_fpIn,d_m_CimagefRefCompressedGain);
KERNEL_defect<<<Dg,Db>>>(d_fpIn,d_m_CimagefRefDM);
}
Thanks !