Problem using templates in linked static CUDA libraries


I’m trying to use some CUDA preprocessing conjointly with Optix, so I managed to setup separate CUDA kernels and wrapping callers that I compile as a static lib in the Optix build, and that I can then call on Optix CUDA interop buffers. (tried first to directly compile an obj for the CUDA sources but got hit by that error and didn’t push further).

Basically works but I have some questions.

That’s the caller in a C++ class method on the Optix main:

void OptixContext::computeTriangleAreasCUDA(Geometry mesh)
		unsigned int nbTriangles = mesh->getPrimitiveCount();
		Buffer indexBuffer = mesh["index_buffer"]->getBuffer();	// RT_FORMAT_INT3, nbTriangles
		RTsize nbIdx;
		assert(nbTriangles == nbIdx);
		Buffer coordsBuffer = mesh["vertex_buffer"]->getBuffer(); // RT_FORMAT_FLOAT3, nbVertices	
		Buffer triAreaBuffer = mesh["area_buffer"]->getBuffer();

		int optixDeviceId = mEnabledDevices[0];	// Run on primary device
		int cudaDeviceId;
		mContext->getDeviceAttribute(optixDeviceId, RT_DEVICE_ATTRIBUTE_CUDA_DEVICE_ORDINAL, sizeof(int), &cudaDeviceId);
		// std::cerr << "Cuda device ordinal " << cudaDeviceId << std::endl;

		void* indexPtr;
		void* coordsPtr;
		void* triAreaPtr;
		indexBuffer->getDevicePointer(cudaDeviceId, &indexPtr);
		coordsBuffer->getDevicePointer(cudaDeviceId, &coordsPtr);
		triAreaBuffer->getDevicePointer(cudaDeviceId, &triAreaPtr);

		cuTriangleAreas(indexPtr, coordsPtr, triAreaPtr, nbTriangles);

And that’s the called function and example kernel :

#include "ULcuda.h"
#include "triangle_area.h"

__global__ void cuTriangleAreasKernel(
	const int3* __restrict__ indexPtr,
	const float3* __restrict__ coordsPtr,
	float* __restrict__ triAreaPtr,
	const int size)
	int primIdx = blockDim.x * blockIdx.x + threadIdx.x;

	if (primIdx < size) {
		const int3* vertexId = &indexPtr[primIdx];
		float area = triangle_area(coordsPtr[vertexId->x], coordsPtr[vertexId->y], coordsPtr[vertexId->z]);
		triAreaPtr[primIdx] = area;

void cuTriangleAreas(void* indexPtr, void* coordsPtr, void* triAreaPtr, int size)
	//get device capability, to avoid block/grid size exceed the upper bound
	cudaDeviceProp prop;
	int device;
	checkCudaErrors(cudaGetDeviceProperties(&prop, device));
	int maxThreads = prop.maxThreadsPerBlock;
	int maxBlocks = prop.maxGridSize[0];

	// for number of triangles < THREADS_PER_BLOCK, but small number of triangles isn't really our concern here
	int threads = min(size, maxThreads);
	int blocks = 1 + (size - 1) / threads;

	std::cerr << "Calling cuda triangle areas with " << blocks << " blocks, " << threads << " threads" << std::endl;

	dim3 dimBlock(threads, 1, 1);
	dim3 dimGrid(blocks, 1, 1);

	cuTriangleAreasKernel <<< dimGrid, dimBlock >>> (
	getLastCudaError("cuTriangleAreasKernel execution failed");

This with Optix SDK 4.1, CUDA 8 and Visual 2015. cuTriangleAreas is prototyped in the include, triangle_areas.h is just the device inline doing the cross product for area calcultation. At some point mEnabledDevices is intialized with the vector of enabled Optix devices ordinals but we can assume mEnabledDevices[0] = 0 for single device.

It works allright, but as I tried to make something a bit cleaner than passing void pointers and assuming I know what I’m doing, I noticed that:

  1. If I try to reinterpret_cast my pointers to the expected types (int3, float3), and declare cuTriangleAreas accordingly, I get a linking error, it cannot find the corresponding cuTriangleAreas function. Clearly a nvcc / visual studio different name mangling problem solved with a definition of cuTriangleAreas as extern "C", but...
  2. Was tempted to use templates for different cases and then extern "C" can't save me anymore. Some of the CUDA samples like reduce show cases where templates come real handy, but there the main program is not linking a separate lib.

So if I understand well it comes from nvcc compiling the pure cuda lib and visual c++ the optix main and different name mangling conventions. Defining cuTriangleAreas as extern “C” works for point 1, Anything I can do to precise compile or link option so I can stay in C++ world and use templates or no go? By the way is it a bad idea using the runtime API and should I use the device API so I can load PTX for my CUDA utilities the same way I do with the Optix programs (which would be a pain since at some point I might want to use the existing CUDA utility libraries though?)