Modularizing Code Dividing large cpp and cu files into several small files

The project I am working on has several ideas and directions we want to go, but the idea is generally the same for all of them. We are simulating environmental phenomena using combined advection, diffusion and reaction dynamics. Currently the code is all folded into one .cpp file and one .cu file that have all of the functions. For this problem we are unsure of the best ways to go about the code and how to deal with issues such as latency and asynchronous computing, so we are developing several versions of the code to test different methods, then adding flags to tell the code which method we want to use.

I am pushing to have the code modularized, i.e. have a central file that never changes then have a separate file for each piece of the code, such as the graphics, the advection, the diffusion, the reaction, etc. This way we can all modify the code in parallel, then drop the updated modules into the main source folder, overwriting the module file with the same name. This will allow us to quickly update the source code without having to go into a massive .cu or .cpp file and find every place we want to change.

We are having some problem pulling the current code apart into modules. Several of the subroutines use texture memory which is allocated and bound on the CPU, but the function itself is done on the GPU. If we allocate and bind the texture memory in the central .cu file, the modules (currently .h files) do not recognize calls to such memory allocations. If we allocate and bind the texture in the module files, it complains that we are attempting to execute a host function (cudaMemcpyToArray) in a device function.

Is there a way to allocate and bind the texture memory in the central .cu file then simply pass this information to the modules? Or does anybody have any other suggestions? We are working using examples from books and the CUDA example code that came with 4.0, modifying the code where necessary, but none of them are set up with this module architecture. I know that it is not common practice, but I believe it will help organize our files and allow for quicker code iterations, then we can combine the final code when we are happy with the results.

Some pieces of code are pasted below:

This allocates the texture memory

texture<float, 2> tex;

texture<float, 2> tex1;

texture<float, 2> tex2;

cudaArray* d_float_array;

cudaArray* d1_float_array;

cudaArray* d2_float_array;

extern "C" 

void initTexture(int width, int height)

{

  // allocate floating-point Cuda array

  cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

  cudaChannelFormatDesc channelDesc1 = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

  cudaChannelFormatDesc channelDesc2 = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

CUDA_SAFE_CALL( cudaMallocArray  ( &d_float_array, &channelDesc, width, height ));

  CUDA_SAFE_CALL( cudaMallocArray  ( &d1_float_array, &channelDesc1, width, height )); 

  CUDA_SAFE_CALL( cudaMallocArray  ( &d2_float_array, &channelDesc2, width, height )); 

// set texture parameters

  tex.addressMode[0] = cudaAddressModeClamp;

  tex.addressMode[1] = cudaAddressModeClamp;

  tex.filterMode = cudaFilterModePoint;

  tex.normalized = false;

tex1.addressMode[0] = cudaAddressModeClamp;

  tex1.addressMode[1] = cudaAddressModeClamp;

  tex1.filterMode = cudaFilterModePoint;

  tex1.normalized = false;

tex2.addressMode[0] = cudaAddressModeClamp;

  tex2.addressMode[1] = cudaAddressModeClamp;

  tex2.filterMode = cudaFilterModePoint;

  tex2.normalized = false;

}

extern "C"

void freeTextures()

{

  CUDA_SAFE_CALL(cudaFreeArray(d_float_array));

  CUDA_SAFE_CALL(cudaFreeArray(d1_float_array));

  CUDA_SAFE_CALL(cudaFreeArray(d2_float_array));

}

This is an example of what we currently have in our main .cu file

CUDA_SAFE_CALL( cudaMemcpyToArray( d_float_array, 0, 0, a, width * height * sizeof(float), cudaMemcpyDeviceToDevice));

    CUDA_SAFE_CALL( cudaBindTextureToArray(tex, d_float_array) );

    d_diffuse_tex<<< height / nthreads, nthreads, 0 >>>( a, ftemp, width, height, Db1, dt, hh);

// diffuse in both directions, texture version

// (out-of-range values are clamped to edge)

__global__ void

d_diffuse_tex(float *grid, float *ftemp, int w, int h, float alpha, float dt, float hh)

{	

		Clamp_Memory(grid,w,h);

int y = blockIdx.x*blockDim.x + threadIdx.x;

grid = &grid[y*w];

  ftemp = &ftemp[y*w];

for (int x = 0; x < w; x++) {

    ftemp[x] = tex2D(tex, x-1, y) + tex2D(tex, x+1, y) +

               tex2D(tex, x, y-1) + tex2D(tex, x, y+1) - 4.0f * tex2D(tex, x, y);

  }

for (int x = 0; x < w; x++) {

    grid[x] += dt * (alpha/(hh*hh)) * ftemp[x];

  }

}

Currently d_diffuse_tex is in the main .cu file as well. We would like to move this function to its own file, .cu or .h, but it cannot then access the tex memory. If we move the initialization code for tex to the d_diffuse_tex file, it complains about cudaMemcpyToArray being a host function being called from a device function.

I know there was quite a bit of redundancy in this post, but hopefully it got the point across.

Any suggestions would be great!

Move the texture declaration up before the #includes?

I know this is considered really bad style in CPU programming. However, since CUDA has no linker, slamming all code into a single file (whether doing it by hand or using the preprocessor) is the only clean solution (unless you are willing to consider dirty pointer tricks).

That sounds like it might work, let me give it a shot.

Thanks!

So to make it not-so-bad, I moved the texture declaration to its own header file and put its #include before the others. That seems to do the trick!

Thank you for your help!