5.0 SDK Separate Compilation Question

My issue is simple: I like to separate device and host code into separate .cu and .cpp source files, respectively. This has worked pretty well for me with the new 5.0 SDK with the one exception of when I’ve defined global variables (i.e. either device or constant) in a .cu source file. Consider the two source files listed below.

foo.cu:

__constant__ int dev_foo[10];

__global__ void kernel(int* bar)
{
    for(int i=0;i<10;++i)
    {   
        bar[i] = dev_foo[i];
    }   
}       
        
void LaunchKernel(int* bar)
{                       
    kernel<<<1,1>>>(bar);           
}

mainFoo.cpp:

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

static void HandleError( cudaError_t err,
                         const char *file,
                         int line ) { 
    if (err != cudaSuccess) {
        printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
                file, line );
        exit( EXIT_FAILURE );
    }   
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))

extern __constant__ int dev_foo[10]; //how to declare global const memory
void LaunchKernel(int* bar);

int main()
{   
    int tmp[10];
    int *bar;

    for(int i=0;i<10;++i)
        tmp[i] = i;

    HANDLE_ERROR(cudaMemcpyToSymbol(dev_foo,tmp,sizeof(int)*10));
    HANDLE_ERROR(cudaMalloc((void**)&bar,sizeof(int)*10));
    for(int i=0;i<10;++i)
        tmp[i] = 0;
    HANDLE_ERROR(cudaMemcpy(bar,tmp,sizeof(int)*10,cudaMemcpyHostToDevice));
    LaunchKernel(bar);
    HANDLE_ERROR(cudaMemcpy(tmp,bar,sizeof(int)*10,cudaMemcpyDeviceToHost));
        
    printf("tmp: ");
    for(int i=0;i<9;++i)
        printf("%i,",tmp[i]);       
    printf("%i\n",tmp[9]);              

    return 0;
}

I compile these files into an executable as follows: nvcc -arch=sm_20 -rdc=true foo.cu mainFoo.cpp -o foo

Running the executable results in the following runtime error:

invalid device symbol in mainFoo.cpp at line XX (i.e. the cudaMemcpyToSymbol line).

So, is it possible to achieve this sort of decomposition? If so, then what am I doing wrong?

Post-script: it seems that this specific kind of separation is only supported with compute capability 3.0+. To that end, the following modification to the one-step compilation command builds and runs as intended: nvcc -arch=sm_30 -rdc=true foo.cu mainFoo.cpp -o foo

Alternatively, and for completeness, the following four-step compilation process (useful in the build config for larger projects) also works on compute capability 3.0+ systems:
nvcc -arch=sm_30 -dc foo.cu -o foo.o
nvcc -arch=sm_30 -dlink foo.o -o link.o
g++ -Wall -c mainFoo.cpp -o mainFoo.o
g++ foo.o link.o mainFoo.o -o foo -lcudart

Which card are you using? With which compute capability? In the first experiment, have you compiled a code for a card having compute capability 3.0 with a -arch=sm_20 option?

Hi,
Can I use CUDA C in intel DPDK environment for using GPU with Intel DPDK.

regards

mati ur rahman

@Mati86 I think it would be better if you create a new topic with your question, otherwise it would be difficult for other users to find it and answer.