Refering __device__ variables in host code?

I have rather large set of functions that are callable both on host (invoked in OpenMP loops) and on device.

This is achieved using host device function declarations.

Some of these functions, ported from CPU, rely on a lookup table stored in a global variable. I would like to make this variable visible for the device-compiled kernles as well.

(I know I should use textures or constant for best performance, and I have seen the macro trick for using textures in the SDK, but this post is about variable scoping not performance.)

The code construct attached at the bottom of this post seems to work.

I observe that a device variable seems to be fully allocated and visible in the host scope, and can be used like a regular variable there.

Copying of data onto device is done using a combination of cudaGetSymbolAddress and cudaMemcpy.

Is this a supported and future proof way of sharing global variable names for host device functions?

Can the host-visible variable be used just like any other global variable? When and where is it allocated and deallocated?

This observed behavior seems to contradict Sec B.2.1 in the Programming Guide which state:

What we really should have is the ability to declare host device for variables as well as for functions, but that yields a “invalid attribute for variable.” warning message.

#include <iostream>

#include <vector>

__device__ float luttable[10][4];

__host__ __device__ void readVar( float* output, int i ) {

  output[i] = luttable[i][3];  

}

__global__ void readDeviceValue( float* output ) {

  readVar( output, threadIdx.x );

}

int main() {

  void* devlutptr;

  cudaGetSymbolAddress( &devlutptr, "luttable" );

// Init CPU table

  for( int i = 0; i < 10; ++i )

	for ( int j = 0; j < 4; ++j )

	  luttable[i][j] = i + j/10.0;

// Copy to device

  cudaMemcpy( devlutptr, luttable, sizeof(float)*40,

		  cudaMemcpyHostToDevice );

const size_t N = 8;

  dim3 gridDim = dim3( 1 ); dim3 blockDim = dim3( N );

float* devoutput;

  cudaMalloc( &devoutput, sizeof(float)*N );

readDeviceValue<<<gridDim, blockDim>>>( devoutput );

std::vector<float> v(N);

  cudaMemcpy( &v[0], devoutput, sizeof(float)*v.size(),  cudaMemcpyDeviceToHost );

for( size_t i = 0; i < v.size(); ++i )

	std::cout << v[i] << " ";

  std::cout << std::endl;

for ( size_t i = 0; i < v.size(); ++i )

	readVar( &v[0], i );

for( size_t i = 0; i < v.size(); ++i )

	std::cout << v[i] << " ";

  std::cout << std::endl;

return 0;	

}

Johan Seland, PhD

Researcher SINTEF ICT