Calling CUDA functions from cpp files - scope of variables

Hello all,

I am having a hard time (>80 hours) figuring out the solution to a problem I’m having. I am trying to speed-up a non-linear optimization routine with CUDA. I’ve already verified that the CUDA kernel works (very fast), and that the NL-optimization is also working when passed a CPU based function.

What I need to do it to execute the kernel from within the .cpp code. I have this working actually using a wrapper function. For example, in my .cpp file, I have a function that is used by by the optimizer to compute a fitness value:

double f(vnl_vector<double> const &x)
	{
		hpoint[0] = (float)x(0);
		hpoint[1] = (float)x(1);

		ExecuteCostfunction(hpoint, hcost, dpoint, dcost);

		return (double)hcost[0];				

	}
ExecuteCostfunction

is in the .cu file, and is a wrapper to the kernel.

extern "C"
float ExecuteCostfunction(float * hpoint, float* hcost, float* dpoint, float* dcost)
{
	cudaError_t cudaStatus;

	cudaStatus = cudaMemcpy(dpoint, hpoint, 2*sizeof(float), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess){
		fprintf(stderr, "Copying the point to the device failed!");
		return -1;
	}

	CostFunction<<<1,1>>>(dpoint, dcost);

	cudaStatus = cudaDeviceSynchronize();
	if (cudaStatus != cudaSuccess) 
	{
		fprintf(stderr, "cudaDeviceSynchronize Error in cost function!");
		return -1;
		
	}

	cudaStatus = cudaMemcpy(hcost, dcost, 1*sizeof(float), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) 
	{
		fprintf(stderr, "Copying the cost to the host failed!");
		return -1;
	}
	
	return hcost[0];
}
__global__ void CostFunction(float* point, float* cost)
{

	float x = point[0];
	float y = point[1];

	//Move to center of image
	x = x+((512-1)/2);
	y = y+((512-1)/2);

	//Normalize Coordinates
	x = x / 512;
    y = y / 512;

	cost[0] = tex2D(texI,x,y);

}

This code does not work, it fails when I try to copy the host data to the device data in

ExecuteCostFunction

. However, I can get it to work if I allocate the memory for all of the device pointers. In the end, though, this is going to be a huge waste of time. I’m wondering how it is possible to keep all of the variables in persisting between the .cpp and .cu code. In theory, I should be able to write a wrapper function that copys all the variables to the device (including a 1024*1024 image), and then have a separate wrapper that simply “crunches the data” (runs the kernel) over and over until the optimization is complete. However, when I do it this way, the data doesn seem to persist.

For example, here is some more code. In this wrapper function, I transfer host variables (including an image) to the GPU:

.cpp file:

int Initialize()
	{
		himage = (float*)malloc(iW*iH*sizeof(float));
		hpoint = (float*)malloc(2*sizeof(float));
		hcost  = (float*)malloc(1*sizeof(float));


		return InitializeGPU(himage, dimage, dcost, dpoint, iW, iH);
		fprintf(stderr, "CXX - Host Address - %p : Device Address - %p \n", himage, dimage);
		
	}

.cu file

extern "C"
int InitializeGPU(float* himage, cudaArray* dimage, float* dcost, float* dpoint, int W, int H)
{
	cudaError_t cudaStatus;
	cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
	
	//Allocate the image on the GPU
	cudaStatus = cudaMallocArray(&dimage,&channelDesc,W,H);
	if (cudaStatus != cudaSuccess) 
	{
        fprintf(stderr, "Allocating the image array failed!");
		return -1;
    }

	// Set texture parameters
    texI.addressMode[0] = cudaAddressModeClamp;
    texI.addressMode[1] = cudaAddressModeClamp;
    texI.filterMode     = cudaFilterModeLinear;
    texI.normalized     = true;    // access with normalized texture coordinates

	//Bind the texture to the array
	cudaStatus = cudaBindTextureToArray(texI, dimage,  channelDesc);
	if (cudaStatus != cudaSuccess) 
	{
        fprintf(stderr, "Binding the texture failed!");
		return -1;
    }

	cudaStatus = cudaMalloc((void **) &dcost, 1*sizeof(float));
	if (cudaStatus != cudaSuccess){
		fprintf(stderr, "Creating a device cost failed!");
		return -1;
	}

	cudaStatus = cudaMalloc((void **) &dpoint, 2*sizeof(float));
	if (cudaStatus != cudaSuccess){
		fprintf(stderr, "Creating a device cost failed!");
		return -1;
	}

	//Send the image to the GPU
	fprintf(stderr, "KERNEL Host Address - %p : Device Address - %p \n", himage, dimage);
	
	cudaStatus = cudaMemcpyToArray(dimage, 0,0,himage, W*H*sizeof(float), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) 
	{
        fprintf(stderr, "Memcpy to image array failed!");
		return -1;
    }


	return 0;
}

You can notice that there are two lines in each function that return the pointer addresses. I check the device pointer address in the execution of the wrapper function (InitializeGPU .cu) and right after in the .cpp code. What I notice is that the device address does not persist - it changes once the wrapper is finished.

Any advice on how to solve my problem would be appreciated.

If you want a pointer that has been allocated (i.e. modified) in a function to show up with the modified value in the calling environment, you cannot pass the bare pointer, you must either pass the address of the pointer (so your function parameter would be float ** dcost, for example, and your function would have to be suitably modified to handle the double-pointer) or else you need to pass the pointer by reference to the function:

InitializeGPU(my_himage, &my_dimage, &my_dcost, ...

To some degree this is a guess, because you don’t actually show the invocation of any of the functions you are having trouble with, but I think this is likely the issue. It’s generally a lot more helpful if you show a complete but simplified example, rather than showing bits and pieces of the stuff you think may be important. By the way, what I’m suggesting here is purely a C/C++ programming issue. It has nothing to do with CUDA.

I think I might get what youre saying. I will post all of my code here:

.cpp

#include <iostream>
#include <stdlib.h>
#include <stdio.h>
#include <cmath>
#include <cuda_runtime.h>

#include <vnl/vnl_vector.h>
#include <vnl/vnl_cost_function.h>

#include <vnl/algo/vnl_lbfgs.h> //limited memory BFGS algorithm (general uncontrained optimization)
#include <vnl/algo/vnl_lbfgsb.h>//constrained BFGS
#include <vnl/algo/vnl_amoeba.h>

using namespace std;

//extern "C"
//void SetConstantDataGPU(float* camera);

extern "C"
int InitializeGPU(float* himage, cudaArray* dimage, float* dcost, float* dpoint, int W, int H);

extern "C"
int ReInitializeGPU(float* himage, cudaArray* dimage, int W, int H);

extern "C"
void CleanUpGPU(cudaArray* dimage, float* dcost, float* dpoint);

extern "C"
float ExecuteCostfunction(float * hpoint, float* hcost, float* dpoint, float* dcost);

//void BFGS();
//void BFGSB();
void Amoeba();

int main()
{
	//BFGS();
	//BFGSB();
	Amoeba();
	return 0;
}

class DistanceFunction : public vnl_cost_function
{
public:

	
	float* himage;
	cudaArray* dimage;
	float* hcost;
	float* dcost;
	float* hpoint;
	float* dpoint;
	float fW;
	float fH;
	int   iW;
	int   iH;
	
	DistanceFunction(const int NumVars) : vnl_cost_function(NumVars)
	{
		iW=512;
		iH=512;
		fW=512.0;
		fH=512.0;
	}

	double f(vnl_vector<double> const &x)
	{
		hpoint[0] = (float)x(0);
		hpoint[1] = (float)x(1);

		ExecuteCostfunction(hpoint, hcost, dpoint, dcost);

		return (double)hcost[0];				

	}

	void gradf(vnl_vector<double> const &x, vnl_vector<double> &dx)
	{	
		/*
		//specify manually
		dx(0) = 2*(x(0) - 2);
		dx(1) = 2*(x(1) - 4);
		*/

		//OR, use Finite Difference gradient
		fdgradf(x, dx);
	}

	void MakeCostImage()
	{

		float mX = fW/2;
		float mY = fH/2;
		float val=0.0;
		int k=0;
		for(float i=0; i<fW; i++)
		{
			for(float j=0; j<fH; j++)
			{
				val = sqrt((i-mX)*(i-mX) + (j-mY)*(j-mY));
				himage[k] = val;
				k++;
			}
		}
	}

	int Initialize()
	{
		int rv = 0;
		himage = (float*)malloc(iW*iH*sizeof(float));
		hpoint = (float*)malloc(2*sizeof(float));
		hcost  = (float*)malloc(1*sizeof(float));

		fprintf(stderr, "1. CXX - Host Address - %p : Device Address - %p \n", himage, dimage);
		rv = InitializeGPU(himage, dimage, dcost, dpoint, iW, iH);
		fprintf(stderr, "3. CXX - Host Address - %p : Device Address - %p \n", himage, dimage);
		return rv;
	}

	int ReInitialize()
	{
		MakeCostImage();
		hpoint[0] = 0;
		hpoint[1] = 0;
		hcost[0]  = 0;
		fprintf(stderr, "CXX - Host Address - %p : Device Address - %p \n", himage, dimage);
		return ReInitializeGPU(himage, dimage, iW, iH);
		fprintf(stderr, "CXX - Host Address - %p : Device Address - %p \n", himage, dimage);
		return -1;
	}

	void CleanUp()
	{
		free(himage);
		free(hpoint);
		free(hcost);
		CleanUpGPU(dimage, dcost, dpoint);
	}

};

void Amoeba()
{
	int ok = 0;
	DistanceFunction Distance(2);
	ok = Distance.Initialize();
	if(ok==-1){Distance.CleanUp();cout << "Error, exiting\n"; return;}


	vnl_amoeba Minimizer(Distance);
	//Minimizer.set_f_tolerance(1e-3);
	Minimizer.set_f_tolerance(1.0);
	Minimizer.set_max_iterations (5);

	//you can define the scale of the space ???
	vnl_vector<double> dx(2);
	dx(0) = 1.0;
	dx(1) = 1.0;

	vnl_vector<double> x0(2);
	x0(0) = 10;
	x0(1) = 5;

	cout << "Started at: " << x0 << endl;
	//Minimizer.minimize(x0);
	Minimizer.minimize(x0, dx);

	cout << "Ended at: " << x0 << endl;	

	cout << "NumEvals: " << Minimizer.get_num_evaluations() << endl;
	
	Distance.CleanUp();
}
#define N 65536
#define TPB 512
#define TPBJ 512
#define SHRSIZE 512


#include <cuda_runtime.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>

texture<float, 2, cudaReadModeElementType> texI;
//texture<float, 2, cudaReadModeElementType> texdIdu;
//texture<float, 2, cudaReadModeElementType> texdIdv;

__global__ void CostFunction(float* point, float* cost)
{

	float x = point[0];
	float y = point[1];

	//Move to center of image
	x = x+((512-1)/2);
	y = y+((512-1)/2);

	//Normalize Coordinates
	x = x / 512;
    y = y / 512;

	cost[0] = tex2D(texI,x,y);

}

extern "C"
float ExecuteCostfunction(float * hpoint, float* hcost, float* dpoint, float* dcost)
{
	cudaError_t cudaStatus;

	cudaStatus = cudaMemcpy(dpoint, hpoint, 2*sizeof(float), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess){
		fprintf(stderr, "Copying the point to the device failed!");
		return -1;
	}

	CostFunction<<<1,1>>>(dpoint, dcost);

	cudaStatus = cudaDeviceSynchronize();
	if (cudaStatus != cudaSuccess) 
	{
		fprintf(stderr, "cudaDeviceSynchronize Error in cost function!");
		return -1;
		
	}

	cudaStatus = cudaMemcpy(hcost, dcost, 1*sizeof(float), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) 
	{
		fprintf(stderr, "Copying the cost to the host failed!");
		return -1;
	}
	
	return hcost[0];
}


extern "C"
int InitializeGPU(float* himage, cudaArray* dimage, float* dcost, float* dpoint, int W, int H)
{
	cudaError_t cudaStatus;
	cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
	
	//Allocate the image on the GPU
	cudaStatus = cudaMallocArray(&dimage,&channelDesc,W,H);
	if (cudaStatus != cudaSuccess) 
	{
        fprintf(stderr, "Allocating the image array failed!");
		return -1;
    }

	// Set texture parameters
    texI.addressMode[0] = cudaAddressModeClamp;
    texI.addressMode[1] = cudaAddressModeClamp;
    texI.filterMode     = cudaFilterModeLinear;
    texI.normalized     = true;    // access with normalized texture coordinates

	//Bind the texture to the array
	cudaStatus = cudaBindTextureToArray(texI, dimage,  channelDesc);
	if (cudaStatus != cudaSuccess) 
	{
        fprintf(stderr, "Binding the texture failed!");
		return -1;
    }

	cudaStatus = cudaMalloc((void **) &dcost, 1*sizeof(float));
	if (cudaStatus != cudaSuccess){
		fprintf(stderr, "Creating a device cost failed!");
		return -1;
	}

	cudaStatus = cudaMalloc((void **) &dpoint, 2*sizeof(float));
	if (cudaStatus != cudaSuccess){
		fprintf(stderr, "Creating a device cost failed!");
		return -1;
	}

	//Send the image to the GPU
	fprintf(stderr, "2. KERNEL Host Address - %p : Device Address - %p \n", himage, dimage);
	
	cudaStatus = cudaMemcpyToArray(dimage, 0,0,himage, W*H*sizeof(float), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) 
	{
        fprintf(stderr, "Memcpy to image array failed!");
		return -1;
    }


	return 0;
}


extern "C"
void CleanUpGPU(cudaArray* dimage, float* dcost, float* dpoint)
{
	cudaFree(dcost);
	cudaFree(dpoint);
	cudaFreeArray(dimage); 	
}

In lines 115-117 of the .cpp code, thats where my test is: The output (see the fprintf statements) is:

1. CXX - Host Address - 0x7f419c740010 : Device Address - 0x626c20
2. KERNEL Host Address - 0x7f419c740010 : Device Address - 0x22636d0
3. CXX - Host Address - 0x7f419c740010 : Device Address - 0x626c20

You can see that the device address is different in the .cu coda than the .cpp code. Thats why I’m confused.

OK I’m not going to rewrite your code for you, but for that one test case, try making this change:

from:

extern "C"
int InitializeGPU(float* himage, cudaArray* dimage, float* dcost, float* dpoint, int W, int H){

to:

extern "C"
int InitializeGPU(float* himage, cudaArray* &dimage, float* dcost, float* dpoint, int W, int H){

and re-run that test case.

Take a look at this:
http://stackoverflow.com/questions/10240161/reason-to-pass-a-pointer-by-reference-in-c

I’ll give that a go. I also found a reference, based on your reply, that suggests doing it another way

cudaArray* dimage → cudaArray** dimage

is that the same thing as?:

cudaArray* dimage → cudaArray* &dimage

OK, that seems to have worked now in the sense that it is getting the correct pointer address. I’m still a bit confused on what is actually happening. What I am assuming is that the address of the pointer, rather than the pointer itself, is being passed.

Now I am having a little trouble because it seems that the image is not actually being read from the texture. The texture read part of the code should be solid, I’ve been using it successfully for awhile.

Is it possible that this part of “InitializeGPU” needs to be changed?

cudaStatus = cudaMemcpyToArray(dimage, 0,0,himage, W*H*sizeof(float), cudaMemcpyHostToDevice);

to

cudaStatus = cudaMemcpyToArray(*dimage, 0,0,himage, W*H*sizeof(float), cudaMemcpyHostToDevice);

?

Thats my best guess, but it doesn’t compile:
error: no suitable conversion function from “cudaArray” to “cudaArray_t” exists

Problem is solved, its all working now (I had a minor unrelated bug). Your suggestion was correct, thanks for the help.