Defining global variables on the host and device at once?

Is there a way I can define a constant as such

int c = 500;

and have it be defined as global for both the host and device at once? Or is the only way to define on the host first and then use a memory copy function?

Yes, just pass the constant as a parameter to the kernel.

An idiom I’ve been using lately is aggregating all constants – values, constant pointers, etc. – into a single constant structure and passing that constant “environment” down through the kernel’s device functions as needed (while being careful to maintain its const’ness).

The compiler does a great job recognizing that these values and pointers reside in the constant memory space. Works well for sm_20+.

2 Likes

I’m a little bit uncertain of that idiom because where do I define what the constants are in value? What I’ve tried so far, for one small structure is the following

Out in host global space

struct matl1
{
	static const double cond;
};

const double matl1::cond = 420.5;

Then inside of main()

matl1 * h_matl1 = (matl1*)malloc(sizeof(matl1));
matl1 * d_matl1;
cudaMalloc((void**)&d_matl1, sizeof(matl1));
cudaMemcpy(d_matl1, h_matl1, sizeof(matl1), cudaMemcpyHostToDevice);
kernel<<<1,1>>>(d_matl1,...);

Then inside of kernel()

__global__ void mainCalc(matl1* d_matl1,...)
{
double cond = d_matl1->cond;
}

And I get the following error:

error : identifier “matl1::cond” is undefined in device code

As a quick test, if I do the following on the host in main()

cout << h_matl1->cond << endl;

It shows me the correct output of 420.5. I am not sure why this isn’t making it into the device.

Anyone with any ideas, or fixes? I haven’t gotten anywhere in the past day. :(

Actually I think this is the same problem allanmac helped me solved a few weeks ago. You might want to check out

https://devtalk.nvidia.com/default/topic/628417/help-with-pointers-inside-structs/#3991261

Alternatively I have copied his reply below

“I think the issue is that you’re passing “hostStruct” by value. Passing the struct by reference should fix it:”

__host__ void fcreate( testStruct& hostStruct ) {
...
}

I don’t think that was my. I can’t even define it like that. That is a host function and I am working with the device.

I’m going to try going the other way. Start by defining everything on the device, and then hope it works copying it over to the host, as opposed to my initial failure of host to device.

Just an update, from device to host works. So I guess I’ll be working with that. That is, defining all of my constants in the device code, and passing them back to the host, as opposed to vice versa.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <stdio.h>

using namespace std;

typedef struct {
    double cond, rho, cp;
} matl1;

__device__  void foo(matl1 *d_matl1) {
    d_matl1->cond = 420.5;
    d_matl1->rho = 8960.0;
	d_matl1->cp = 390.0;
}

__global__ void myKernel(matl1 *d_matl1){
	foo(d_matl1);
	printf("Device is %.2f\n",d_matl1->cond);
}

int main() {

	matl1 *h_matl1, *d_matl1;

	h_matl1 = (matl1*)malloc(sizeof(matl1));
    cudaMalloc((void**)&d_matl1, sizeof(matl1));
	
	myKernel  <<<1,1 >>>(d_matl1);
    cudaThreadSynchronize();

	cudaMemcpy(h_matl1, d_matl1, sizeof(matl1), cudaMemcpyDeviceToHost);

	printf("Host is %.2f\n",h_matl1->cond);
	cin.ignore();
    return 0;
}

I’m sure you could have replaced the host with a global in my example.

const int c = 500;

defined at global scope, should be visible to both host and device code defined in that compilation unit.

1 Like

But what about constant variables of type char[]?
How can I make these visible for host and device at the same time?
Just by defining them at global scope (like for type int) doesn’t work.

That’s correct. Global-scope constant usage in device code has various limitations

One possible approach is what was suggested already in this thread:

$ cat t107.cu
#include <cstdio>

struct my_const{
const char hello[6] = {'h', 'e', 'l', 'l', 'o', 0};
} my_c;

__global__ void k(int d, my_const c){

        printf(" d   = %d\n", d);
        printf(" str = %s\n", c.hello);
}

int main(){

        k<<<1,1>>>(2, my_c);
        cudaDeviceSynchronize();
}
$ nvcc -o t107 t107.cu
$ cuda-memcheck ./t107
========= CUDA-MEMCHECK
 d   = 2
 str = hello
========= ERROR SUMMARY: 0 errors
$
1 Like

Before posting here, I checked CUDA docs and came up with exactly the page you linked in above post with limitations. However I believe it isn’t really 100% obvious that only type int can be used with space annotation const at global scope to be visible for host and device.

I’m presently in optimization process of my code and checked this video and also the previous ones. VERY interesting and mandatory - must see!

Regarding performance and referring to your code snippet above, in which GPU memory (global, shared, …) are function arguments stored when they are passed to global function?

On the other side, where are __constant__ and const int data stored on the GPU memory?

I would like to have read access as fast as possible!

I don’t know what you have done if you make no mention of it. My purpose here is not just to help you individually, but to help others who may be reading this and looking for answers. It seems you’ve come away with an incorrect description, anyway. It’s not just “only type int”. The wording is “builtin integral types” and also builtin floating point types on linux. So on linux, float and double are possible, and on linux and windows int, short, size_t and others should be possible. In fact char is a builtin integral type (and can be used as a constant this way), but the documentation doesn’t mention array of integral types, so those are not supported. If you think the documentation can be improved, that is a possible reason to file a bug report.

Where __global__ function arguments are stored is also covered in the programming guide. They are stored in __constant__ memory, which is a resource that is:

  1. Its own logical space
  2. Physically backed by GPU DRAM
  3. serviced via a per-SM cache

const int data, defined at global scope as being discussed here lately, will generally be delivered to the GPU via the instruction stream, as an immediate constant operand, in my experience. This behavior is really up to the compiler, not specified AFAIK.

BTW there are many questions on the web forums already about constant data usage and their implications.

1 Like

Thanks. Your test code works fine and nicely shows how a char[] can be successfully passed by value as argument through a struct to the kernel.

So, if a char[] can be passed as kernel argument by value through a struct, why is it not possible to pass the char[] by value directly as argument to the kernel? The struct in your example has a variable length (not fixed as int, float, …), so has char[] (keeping in mind the max. 256 bytes argument size limitation ([ [is there any limit on # of arguments in cuda kernel?] (is there any limit on # of arguments in cuda kernel?).

I spent 3 cups of coffee and didn’t find anything inside the exhaustive NVIDIA docs, SO, Google, …

Where can I find the limitations regarding which kernel argument types can be passed by value?

It’s variable in the sense that you as a programmer can change its size, but its not variable in the sense that its size is known at compile time, and that characteristic is important for the discussion topic here.

You may wish to file a bug to have the docs improved.

For the remainder here, this is my conjecture.

I think a generally correct statement would be that __global__ function parameters are expected to be of non-array type. From C++ language usage, I’m pretty sure this means they can be of scalar type or of class type, which lines up with my understanding.

I suspect one of the problems of array types is that the name of an array when used by itself decays to a pointer. AFAIK this is a language feature of C and C++. This doesn’t really matter if I use a pointer to point to the beginning of an array in host code, because that pointer is usable in a roughly interchangeable fashion with an array reference. However, when passing data from host to device, a pointer (to host data) is not what you want. It’s not usable in device code. You might be able to work around this using managed memory, depending on your goals.

I acknowledge you can probably punch holes in these statements. Furthermore, the CUDA developers have worked around other limitations that would seem to arise from language behavior and host/device separation, so my guess is that this topic was just never really enough of a problem/limitation to invest development effort to address. Or there may really be extremely-difficult-to-resolve problems around the usage of fixed-length array types. I don’t know.

That’s probably about the extent of what I can say, so feel free to ask away but I think there is a good chance I wouldn’t be able to respond to further questions in this vein. I’m not very good with “why is it this way?” questions.

1 Like