I’m having an issue creating a static library containing Cuda code in Windows 7, using Visual Studio 2010. I’m using Cuda 5.0, which allows relocatable device code.
I have created a simple VS2010 solution that replicates the issue I’m having. The solution contains
two projects. One is an application project, which has a single .cpp file containing a main function.
//main.cpp
#include <iostream>
#include "../cuda_separate_library/test_kernel.cuh"
int main()
{
int a[100];
int b[100];
int out[100];
for(int i = 0; i < 100; i++)
{
a[i] = i+5;
b[i] = i*4-2;
}
addArrayWrapper(100, out, a, b);
for(int i = 0; i < 100; i++)
{
std::cout << a[i] << " + " << b[i] << " = " << out[i] << std::endl;
}
}
The other project is a static library project containing two Cuda headers and two Cuda source files. One of the source files contains a kernel function that calls a device function in the other source file. This is not possible without Cuda 5.0 and the -rdc flag on the Cuda compiler.
//test_kernel.cuh:
#ifndef TEST_KERNEL_HEADER
#define TEST_KERNEL_HEADER
void addArrayWrapper(unsigned int size, int* out, int* a, int* b);
#endif // TEST_KERNEL_HEADER
//test_kernel.cu:
#include "test_kernel.cuh"
#include "test_math.cuh"
#include <cuda_runtime_api.h>
__global__ void addArrayKernel(unsigned int size, int* out, int* a, int* b)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if(idx < size)
{
// Index into array.
int *aIndex = a+idx;
int *bIndex = b+idx;
int *outIndex = out+idx;
add(*outIndex, *aIndex, *bIndex);
}
}
void addArrayWrapper(unsigned int size, int* out, int* a, int* b)
{
if(size > 256) size = 256;
// Allocate device memory.
int * aDevice;
int * bDevice;
int * outDevice;
cudaMalloc(&aDevice, size*sizeof(int));
cudaMalloc(&bDevice, size*sizeof(int));
cudaMalloc(&outDevice, size*sizeof(int));
// Copy input array to device memory.
cudaMemcpyAsync(aDevice, a, size*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpyAsync(bDevice, b, size*sizeof(int), cudaMemcpyHostToDevice);
// Launch kernel.
addArrayKernel<<<1,256>>>(size, outDevice, aDevice, bDevice);
// Copy output array to host memory.
cudaMemcpy(out, outDevice, size*sizeof(int), cudaMemcpyDeviceToHost);
// Free device memory.
cudaFree(aDevice);
cudaFree(bDevice);
cudaFree(outDevice);
}
//test_math.cuh:
#ifndef TEST_MATH_HEADER
#define TEST_MATH_HEADER
__device__ void add(int& out, const int& a, const int& b);
#endif // TEST_MATH_HEADER
//test_math.cu
#include "test_math.cuh"
__device__ void add(int& out, const int& a, const int& b)
{
out = a + b;
}
All the Cuda source files compile correctly. The separate object files are sent to Lib.exe to create a library using the following command:
Lib.exe /OUT:“L:\cuda_separate_compile_test\Debug\cuda_separate_library.lib” /NOLOGO Debug\test_kernel.cu.obj Debug\test_math.cu.obj “Debug\cuda_separate_library.device-link.obj”
This fails with the following error:
LINK : fatal error LNK1181: cannot open input file ‘Debug\cuda_separate_library.device-link.obj’
cuda_separate_library.device-link.obj is supposed to be the output of the Cuda Linker, which doesn’t appear to even get invoked in this case. However, if I temporarily switch the library project
to an application and rebuild, the Cuda Linker gets invoked, creating cuda_separate_library.device-link.obj properly, before failing when it gets passed to the Visual Studio linker. When I switch the project back to a static library and build, Lib.exe is able to find the device-link.obj file (created during the application build) and everything is happy. I can then build my executable (the other project) that links against the static library and it all works.
I can, alternatively, copy the add() function into the kernel code directly, turn off the -rdc switch,
and everything is happy again. However, this defeats the whole purpose of having reusable device code
in a separate file.
It appears that the integration between Cuda and VS2010 isn’t working correctly for static library
builds. I’ve examined the properties sheet provided for VS integration (CUDA 5.0.props), and making the following change seems to fix the issue (either change in the provided props sheet, or make the change in another imported props sheet):
On line 7,
<CudaLinkBeforeTargets>Link</CudaLinkBeforeTargets>
becomes
<CudaLinkBeforeTargets>Link;Lib</CudaLinkBeforeTargets>
This forces the Cuda Linker to run before the Lib command that creates the static library, in addition to running the Cuda Linker before the Link command that links an application. This seems to work, but I’m not particularly well versed in how VS targets work. Is this a good idea? Is there a better way to fix this?
Thanks.