Only integer host variables can be directly read in a device function?

Does anyone know why only constant integer host variables can be directly read in a device function (i.e., without using cudaMallocHost)? Furthermore, is this (please see code below, only ‘gf’ is undefined) really a direct CPU memory access (zero-copy) or those int variables are been copied automagically to GPU’s constant memory (or to another memory)?

And why I cannot find such basic explanation in any material (book, tutorial or programming guide) about CUDA when they all use #define in their very first example and conversely states that you cannot access host memory via device function (without unified memory support)?!

TIA.

#define N 10
const int gn = 20;
const float gf = 3.1415f;

__global__ void kernel(){
    int n = N + gn;
    float f = gf;
}

int main(){
    kernel<<<1,1>>>();
    return 0;
}

A #define is more or less just a replacement in the code. So no copying necessary because it is not really a variable. I’m sure the C purist will object to such a broad statement! :)

Automatic mapping of const int variables has been there since as far as I can remember, when zero-copy was not part of the toolkit, so it is likely that the variables are copied automatically to constant memory, though I have never verified this.

As for why it does not work with floats, I have wondered that myself and always brushed it aside convincing myself that it had to do with possible differences of the internal representation (e.g. denormals, inf, etc).

#define is a C/C++ preprocessor macro. Detailed explanations of it are outside the scope of most CUDA material because those materials assume knowledge of C/C++. The CUDA device compiler doesn’t handle #define any differently than any of the host compilers you’re aware of.

Regarding variables of type const, those are not handled by the preprocessor, but nevertheless the compiler is pretty much free to optimize the usage of that variable in any way that it sees fit. It may choose to store it in memory somewhere, it may choose to load it in a register and use it from there, or it may choose to inject the const value as an immediate operand to an instruction. Again, none of this is specific to CUDA.

I don’t see any indication that “it does not work with floats”

With a slight modification to your code to facilitate the remaining investigation:

#include <stdio.h>
#define N 10
const int gn = 20;
const float gf = 3.1415f;

__global__ void kernel(){
    int n = N + gn;
    float f = gf+1.0f;
    printf("f = %f, n = %d\n", f, n);
}

int main(){
    kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    return 0;
}

I can compile and run it successfully on CUDA 6.5, and observe the expected results.

Furthermore, we can take a look at the PTX to get an idea of where the compiler is headed with this

(nvcc -arch=sm_20 -ptx -src-in-ptx -G …)

//t545.cu:6 __global__ void kernel(){
        .loc    1 6 0

        .loc 1 6 1

        mov.u64         %SPL, __local_depot0;
        cvta.local.u64  %SP, %SPL;
        mov.u32         %r1, 30;
func_exec_begin0:

//t545.cu:7     int n = N + gn;
        .loc    1 7 1
tmp0:
        mov.b32         %r2, %r1;
tmp1:
        mov.f32         %f1, 0f4084872B;

We see that the compiler has observed that:

  1. gf = 3.1415f
  2. it is being added to 1.0f
  3. the result must be 4.1415f

and it has taken the binary representation of this number as an immediate operand, and loaded it into a register, to prepare for use by the in-kernel printf function later:

mov.f32         %f1, 0f4084872B;

The binary representation of 4.1415 (as a float) is 4084872B (hex)

In the preceding lines, the compiler is doing something similar with N and gn ( N + gn = 30):

mov.u32         %r1, 30;
func_exec_begin0:

//t545.cu:7     int n = N + gn;
        .loc    1 7 1
tmp0:
        mov.b32         %r2, %r1;

None of this is actually CUDA-specific. This is all pretty ordinary compiler behavior.

(disclaimer: I don’t ordinarily recommend use of PTX instead of SASS for analysis of code. But in this case it is easier to parse and it illustrates the concept in a similar fashion to what SASS analysis would uncover.)