Hello all,
I am having a hard time (>80 hours) figuring out the solution to a problem I’m having. I am trying to speed-up a non-linear optimization routine with CUDA. I’ve already verified that the CUDA kernel works (very fast), and that the NL-optimization is also working when passed a CPU based function.
What I need to do it to execute the kernel from within the .cpp code. I have this working actually using a wrapper function. For example, in my .cpp file, I have a function that is used by by the optimizer to compute a fitness value:
double f(vnl_vector<double> const &x)
{
hpoint[0] = (float)x(0);
hpoint[1] = (float)x(1);
ExecuteCostfunction(hpoint, hcost, dpoint, dcost);
return (double)hcost[0];
}
ExecuteCostfunction
is in the .cu file, and is a wrapper to the kernel.
extern "C"
float ExecuteCostfunction(float * hpoint, float* hcost, float* dpoint, float* dcost)
{
cudaError_t cudaStatus;
cudaStatus = cudaMemcpy(dpoint, hpoint, 2*sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess){
fprintf(stderr, "Copying the point to the device failed!");
return -1;
}
CostFunction<<<1,1>>>(dpoint, dcost);
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "cudaDeviceSynchronize Error in cost function!");
return -1;
}
cudaStatus = cudaMemcpy(hcost, dcost, 1*sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "Copying the cost to the host failed!");
return -1;
}
return hcost[0];
}
__global__ void CostFunction(float* point, float* cost)
{
float x = point[0];
float y = point[1];
//Move to center of image
x = x+((512-1)/2);
y = y+((512-1)/2);
//Normalize Coordinates
x = x / 512;
y = y / 512;
cost[0] = tex2D(texI,x,y);
}
This code does not work, it fails when I try to copy the host data to the device data in
ExecuteCostFunction
. However, I can get it to work if I allocate the memory for all of the device pointers. In the end, though, this is going to be a huge waste of time. I’m wondering how it is possible to keep all of the variables in persisting between the .cpp and .cu code. In theory, I should be able to write a wrapper function that copys all the variables to the device (including a 1024*1024 image), and then have a separate wrapper that simply “crunches the data” (runs the kernel) over and over until the optimization is complete. However, when I do it this way, the data doesn seem to persist.
For example, here is some more code. In this wrapper function, I transfer host variables (including an image) to the GPU:
.cpp file:
int Initialize()
{
himage = (float*)malloc(iW*iH*sizeof(float));
hpoint = (float*)malloc(2*sizeof(float));
hcost = (float*)malloc(1*sizeof(float));
return InitializeGPU(himage, dimage, dcost, dpoint, iW, iH);
fprintf(stderr, "CXX - Host Address - %p : Device Address - %p \n", himage, dimage);
}
.cu file
extern "C"
int InitializeGPU(float* himage, cudaArray* dimage, float* dcost, float* dpoint, int W, int H)
{
cudaError_t cudaStatus;
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
//Allocate the image on the GPU
cudaStatus = cudaMallocArray(&dimage,&channelDesc,W,H);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "Allocating the image array failed!");
return -1;
}
// Set texture parameters
texI.addressMode[0] = cudaAddressModeClamp;
texI.addressMode[1] = cudaAddressModeClamp;
texI.filterMode = cudaFilterModeLinear;
texI.normalized = true; // access with normalized texture coordinates
//Bind the texture to the array
cudaStatus = cudaBindTextureToArray(texI, dimage, channelDesc);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "Binding the texture failed!");
return -1;
}
cudaStatus = cudaMalloc((void **) &dcost, 1*sizeof(float));
if (cudaStatus != cudaSuccess){
fprintf(stderr, "Creating a device cost failed!");
return -1;
}
cudaStatus = cudaMalloc((void **) &dpoint, 2*sizeof(float));
if (cudaStatus != cudaSuccess){
fprintf(stderr, "Creating a device cost failed!");
return -1;
}
//Send the image to the GPU
fprintf(stderr, "KERNEL Host Address - %p : Device Address - %p \n", himage, dimage);
cudaStatus = cudaMemcpyToArray(dimage, 0,0,himage, W*H*sizeof(float), cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess)
{
fprintf(stderr, "Memcpy to image array failed!");
return -1;
}
return 0;
}
You can notice that there are two lines in each function that return the pointer addresses. I check the device pointer address in the execution of the wrapper function (InitializeGPU .cu) and right after in the .cpp code. What I notice is that the device address does not persist - it changes once the wrapper is finished.
Any advice on how to solve my problem would be appreciated.