Per-Thread global variables

porst17 · July 5, 2010, 4:23pm

Hi,

I have a class in device code with a structure similar to this

int p;

struct A

{

	int v;

	__device__ A( int vv ) { this->v = vv; }

	__device__ A operator+( A a ) { return A((v+a.v)%p); } 

};

Here I need p to be a per-thread global variable, like in plain C/C++. p is different for each thread and it is set in one of the first lines during the kernel execution.

But I don’t know how to declare the variable p. Using

__device__ int p;

puts p into global memory, which is slow and not per thread.

Any advices? p is accessed very frequently and should be placed in register for each thread. But I don’t know how to achieve that.

SPWorley · July 5, 2010, 5:30pm

By default, variables are already stored per-thread in registers… you don’t need to do anything special nor need any keywords.

Though perhaps you know this and what you really mean is you want them globally SCOPED but still a regular plain old per-thread private register variable. The only thing that would give you would be the ability to call functions and avoid having to pass an extra argument to the function call. There’s no easy way to do that. But it’s a bad idea anyway… explicitly passing the variables by reference is much better coding practice (no magic indirect side effects hidden from yourself) and has no loss of functionality or efficiency.

porst17 · July 5, 2010, 7:32pm

Hi, thanks for your reply.

Yes, what I mean is a globally scoped variable. I know, that it is bad coding practice and ALWAYS (really! External Image ) avoid such things, when coding C++ on the CPU. The problem here is, that I use operator overloading for arithmetic in prime fields (that’s why the global variable is called “p” ;-)) and I can not pass an additional argument the operator. I also don’t want to fall back to procedural programming, because the operators make my code much more readable (ever used Java for implementing +,-,* on your own number/matrix types?).

A static class variable, that is per thread, would fit perfectly. But I don’t know, how to achieve this. Any ideas?

SPWorley · July 6, 2010, 12:03am

You can’t do this on the CPU either. C++ global variables aren’t per thread. You need to use OS dependent thread-local-storage which is awkward and far from transparent.

But getting back to the GPU…

You could just give up the elegance of operator overloading and make your own intrinsics like C= add(A,B, my_extra_data).

If your per-thread data is small (a word or two) then you can store it in shared memory, and access it inside any function by indexing based on the thread ID.

This is even reasonably efficient… the downside is it eats your shared memory and limits you to a small amount of values per thread.

You could define a new class (“BoundClass”) that has any per-thread constants “baked in”… though this may use redundant registers if you have many variables. You could even bind them per-operation which would be ugly but it would work and be efficient since the inline expansion would bake them out again. C= BoundClass(A, my_extra_data)+BoundClass(B, my_extraData); You’d need a conversion operator back from BoundClass back to your original variables, but that’s easy.

The intrinsic method is likely the cleanest… it’s clear, reasonably portable, and doesn’t have any hacks. You just don’t have operator notation.

porst17 · July 6, 2010, 1:29pm

Maybe your are right. The operator notation would be quite nice, but it is hard to code in my case.

Now I defined my own intrisics as ordinary methods of my class:

struct A

{

	unsigned int v;

	__device__ A( int vv ) { this->v = vv; }

	__device__ inline A add( A a, const unsigned int &p ) { return A((v+a.v)%p); }

}

Last question: What is the most efficient way to pass p to the function? By reference? By const reference? By value? Redundant copies of the constant p should be avoided (that was, what I originally wanted to achieve with the global scoped variable), so I think const unsigned int &p should be the right way!?

SPWorley · July 6, 2010, 3:14pm

No need, you can just pass ints and other fundamental types by value.

There will be no temporary use. Nearly all function calls in CUDA become inline expanded, meaning function arguments are not pushed on a stack or saved in temporaries unless the function itself needs a temporary (perhaps because an argument itself is an expression).

For classes and structs, just pass by reference or const reference. Though again the function is usually inline expanded so even passing by value would be equivalent efficiency in most cases.

porst17 · July 6, 2010, 4:19pm

Ok, thanks for your help.

Topic		Replies	Views
stupid memory question CUDA Programming and Performance	11	2091	July 28, 2009
Of Scope, Functions, Threads and Registers Or, local variables with global scope CUDA Programming and Performance	3	2813	September 27, 2008
Question about the scope of global __device__ variable CUDA Programming and Performance	2	1387	March 1, 2010
Declare global variable in function scope or suggest alternate solution CUDA Programming and Performance	2	929	March 21, 2011
thread-local variables Legacy PGI Compilers	3	3953	April 4, 2011
global variable in CUDA CUDA Programming and Performance	1	14354	January 27, 2011
Global value CUDA Programming and Performance	4	3221	June 27, 2007
data locally defined in a kernel Legacy PGI Compilers	11	11104	September 2, 2010
Variables seen by all threads CUDA Programming and Performance	2	4032	November 15, 2011
Question about shared variables CUDA Programming and Performance	2	5578	December 12, 2010

Per-Thread global variables

Related topics