Per-Thread global variables


I have a class in device code with a structure similar to this

int p;

struct A


	int v;

	__device__ A( int vv ) { this->v = vv; }

	__device__ A operator+( A a ) { return A((v+a.v)%p); } 


Here I need p to be a per-thread global variable, like in plain C/C++. p is different for each thread and it is set in one of the first lines during the kernel execution.

But I don’t know how to declare the variable p. Using

__device__ int p;

puts p into global memory, which is slow and not per thread.

Any advices? p is accessed very frequently and should be placed in register for each thread. But I don’t know how to achieve that.

By default, variables are already stored per-thread in registers… you don’t need to do anything special nor need any keywords.

Though perhaps you know this and what you really mean is you want them globally SCOPED but still a regular plain old per-thread private register variable. The only thing that would give you would be the ability to call functions and avoid having to pass an extra argument to the function call. There’s no easy way to do that. But it’s a bad idea anyway… explicitly passing the variables by reference is much better coding practice (no magic indirect side effects hidden from yourself) and has no loss of functionality or efficiency.

Hi, thanks for your reply.

Yes, what I mean is a globally scoped variable. I know, that it is bad coding practice and ALWAYS (really! :shifty: ) avoid such things, when coding C++ on the CPU. The problem here is, that I use operator overloading for arithmetic in prime fields (that’s why the global variable is called “p” ;-)) and I can not pass an additional argument the operator. I also don’t want to fall back to procedural programming, because the operators make my code much more readable (ever used Java for implementing +,-,* on your own number/matrix types?).

A static class variable, that is per thread, would fit perfectly. But I don’t know, how to achieve this. Any ideas?

You can’t do this on the CPU either. C++ global variables aren’t per thread. You need to use OS dependent thread-local-storage which is awkward and far from transparent.

But getting back to the GPU…

You could just give up the elegance of operator overloading and make your own intrinsics like C= add(A,B, my_extra_data).

If your per-thread data is small (a word or two) then you can store it in shared memory, and access it inside any function by indexing based on the thread ID.

This is even reasonably efficient… the downside is it eats your shared memory and limits you to a small amount of values per thread.

You could define a new class (“BoundClass”) that has any per-thread constants “baked in”… though this may use redundant registers if you have many variables. You could even bind them per-operation which would be ugly but it would work and be efficient since the inline expansion would bake them out again. C= BoundClass(A, my_extra_data)+BoundClass(B, my_extraData); You’d need a conversion operator back from BoundClass back to your original variables, but that’s easy.

The intrinsic method is likely the cleanest… it’s clear, reasonably portable, and doesn’t have any hacks. You just don’t have operator notation.

Maybe your are right. The operator notation would be quite nice, but it is hard to code in my case.

Now I defined my own intrisics as ordinary methods of my class:

struct A


	unsigned int v;

	__device__ A( int vv ) { this->v = vv; }

	__device__ inline A add( A a, const unsigned int &p ) { return A((v+a.v)%p); }


Last question: What is the most efficient way to pass p to the function? By reference? By const reference? By value? Redundant copies of the constant p should be avoided (that was, what I originally wanted to achieve with the global scoped variable), so I think const unsigned int &p should be the right way!?

No need, you can just pass ints and other fundamental types by value.

There will be no temporary use. Nearly all function calls in CUDA become inline expanded, meaning function arguments are not pushed on a stack or saved in temporaries unless the function itself needs a temporary (perhaps because an argument itself is an expression).

For classes and structs, just pass by reference or const reference. Though again the function is usually inline expanded so even passing by value would be equivalent efficiency in most cases.

Ok, thanks for your help.