The project I am working on has several ideas and directions we want to go, but the idea is generally the same for all of them. We are simulating environmental phenomena using combined advection, diffusion and reaction dynamics. Currently the code is all folded into one .cpp file and one .cu file that have all of the functions. For this problem we are unsure of the best ways to go about the code and how to deal with issues such as latency and asynchronous computing, so we are developing several versions of the code to test different methods, then adding flags to tell the code which method we want to use.
I am pushing to have the code modularized, i.e. have a central file that never changes then have a separate file for each piece of the code, such as the graphics, the advection, the diffusion, the reaction, etc. This way we can all modify the code in parallel, then drop the updated modules into the main source folder, overwriting the module file with the same name. This will allow us to quickly update the source code without having to go into a massive .cu or .cpp file and find every place we want to change.
We are having some problem pulling the current code apart into modules. Several of the subroutines use texture memory which is allocated and bound on the CPU, but the function itself is done on the GPU. If we allocate and bind the texture memory in the central .cu file, the modules (currently .h files) do not recognize calls to such memory allocations. If we allocate and bind the texture in the module files, it complains that we are attempting to execute a host function (cudaMemcpyToArray) in a device function.
Is there a way to allocate and bind the texture memory in the central .cu file then simply pass this information to the modules? Or does anybody have any other suggestions? We are working using examples from books and the CUDA example code that came with 4.0, modifying the code where necessary, but none of them are set up with this module architecture. I know that it is not common practice, but I believe it will help organize our files and allow for quicker code iterations, then we can combine the final code when we are happy with the results.
Some pieces of code are pasted below:
This allocates the texture memory
texture<float, 2> tex;
texture<float, 2> tex1;
texture<float, 2> tex2;
cudaArray* d_float_array;
cudaArray* d1_float_array;
cudaArray* d2_float_array;
extern "C"
void initTexture(int width, int height)
{
// allocate floating-point Cuda array
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaChannelFormatDesc channelDesc1 = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaChannelFormatDesc channelDesc2 = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
CUDA_SAFE_CALL( cudaMallocArray ( &d_float_array, &channelDesc, width, height ));
CUDA_SAFE_CALL( cudaMallocArray ( &d1_float_array, &channelDesc1, width, height ));
CUDA_SAFE_CALL( cudaMallocArray ( &d2_float_array, &channelDesc2, width, height ));
// set texture parameters
tex.addressMode[0] = cudaAddressModeClamp;
tex.addressMode[1] = cudaAddressModeClamp;
tex.filterMode = cudaFilterModePoint;
tex.normalized = false;
tex1.addressMode[0] = cudaAddressModeClamp;
tex1.addressMode[1] = cudaAddressModeClamp;
tex1.filterMode = cudaFilterModePoint;
tex1.normalized = false;
tex2.addressMode[0] = cudaAddressModeClamp;
tex2.addressMode[1] = cudaAddressModeClamp;
tex2.filterMode = cudaFilterModePoint;
tex2.normalized = false;
}
extern "C"
void freeTextures()
{
CUDA_SAFE_CALL(cudaFreeArray(d_float_array));
CUDA_SAFE_CALL(cudaFreeArray(d1_float_array));
CUDA_SAFE_CALL(cudaFreeArray(d2_float_array));
}
This is an example of what we currently have in our main .cu file
CUDA_SAFE_CALL( cudaMemcpyToArray( d_float_array, 0, 0, a, width * height * sizeof(float), cudaMemcpyDeviceToDevice));
CUDA_SAFE_CALL( cudaBindTextureToArray(tex, d_float_array) );
d_diffuse_tex<<< height / nthreads, nthreads, 0 >>>( a, ftemp, width, height, Db1, dt, hh);
// diffuse in both directions, texture version
// (out-of-range values are clamped to edge)
__global__ void
d_diffuse_tex(float *grid, float *ftemp, int w, int h, float alpha, float dt, float hh)
{
Clamp_Memory(grid,w,h);
int y = blockIdx.x*blockDim.x + threadIdx.x;
grid = &grid[y*w];
ftemp = &ftemp[y*w];
for (int x = 0; x < w; x++) {
ftemp[x] = tex2D(tex, x-1, y) + tex2D(tex, x+1, y) +
tex2D(tex, x, y-1) + tex2D(tex, x, y+1) - 4.0f * tex2D(tex, x, y);
}
for (int x = 0; x < w; x++) {
grid[x] += dt * (alpha/(hh*hh)) * ftemp[x];
}
}
Currently d_diffuse_tex is in the main .cu file as well. We would like to move this function to its own file, .cu or .h, but it cannot then access the tex memory. If we move the initialization code for tex to the d_diffuse_tex file, it complains about cudaMemcpyToArray being a host function being called from a device function.
I know there was quite a bit of redundancy in this post, but hopefully it got the point across.
Any suggestions would be great!