Scopes, local and global variables

Hi all,

I guess I don’t understand some basic issue here about what variable is allowed to live in what scope. For the sake of an example I took the matrix multiplication example from the docs (which works fine) and tried to modify it so that it computes not the matrix product but the matrix product multiplied by a scalar.

The code is left as in the example, only a new variable is introduced, x, which will multiply the product. x lives and is set on the host while cuda_x lives on the device and is set to be equal to x.

This is what I thought would be a good approach:

float x;

__device__ float * cuda_x;

__global__ void Muld( float*, float*, int, int, float* );

void Mul( const float* A, const float* B, int hA, int wA, int wB, float* C )


     ........... same as in the docs, plus the following.............

   cudaMalloc( (void**)&cuda_x, sizeof( float ) );

    cudaMemcpy( cuda_x, &x, sizeof( float ), cudaMemcpyHostToDevice );


__global__ void Muld( float* A, float* B, int wA, int wB, float* C )


     ........... same as in the docs, but the last line is changed to ................


     C[c + wB * ty + tx] = (*cuda_x) * Csub;


int main( void )


     x = 2.0;

    .......... create the matrices on the host...............

     .......... call Mul( ) ...........

     .......... calculate the product on the host too and compare......


Without the modification everything works perfectly, but when I include the parameter x and cuda_x I start getting ‘unspecified launch failure’ from cudaGetLastError.

What am I doing wrong?

While trying to debug this I compiled the above code with emulation turned on and there the strange thing was that if the declaration of the variable cuda_x is left as it is

device float * cuda_x;

the emulation version gave a seg fault, while changing the declaration to

float * cuda_x;

it ran fine. Isn’t it the point of the emulation mode that I don’t have to change declarations and such but the compiler figures it out? Probably I’m missing some basic thing here too.

Or is it so that any variable that the kernel uses must be passed to it explicitly as an argument?

You shouldn’t use the device qualifier on the variable cuda_x.

cudaMalloc is a host function that allocates memory on the device and writes a pointer to that memory into a host variable. Then you need to pass the pointer to the kernel as a parameter. Because the CUDA runtime passes parameters by value to global functions, the (actual) parameter cuda_x needs to be a host variable even though it contains a device address.

So cuda_x should be a host variable, not a device variable.


Thanks, now it makes more sense.

What is still not clear is whether a kernel can use variables that are not passed to it as an argument. So for instance if there is a float that is used by many kernels I thought I would put it into constant memory and each kernel will be able to get it from there and it won’t be necessary to pass it to every kernel as an extra argument. Is this possible? If so, how would such a value be declared and actually how would it be put into constant memory? I understand that for an array one needs to use cudaMalloc but is this the case for a single value as well?

Just declare it such that it is visible to the global function. This usually means that you put it in the .cu file right on the top.

__constant__ float deviceA;

Then all following global and device functions can access it by simply using the symbol just like you would expect it from a C const.

result = 2*x*y + deviceA;

Before starting the kernel, the host has to upload a value to the address the symbol points to. As you do not know this address on the host side (the kernel is not running yet) you cannot give it to cudaMemcpy or such. Instead you have to ask CUDA where the symbol will be at run time. The cudaMemcpyToSymbol is you friend.

float hostA = 1.0f;


The manual says in that the constant has lifetime of an application, so it lives as long as a global C variable, i.e. it is valid for all kernel invocations.


Thanks very much, this clarified a lot. :)

So if I want to fill an array of non-constant constants :-), I would do:

__constant__ float lengths[100];

float hostLengths[100];

for ( i=0; i<100; i++ )


    FillLengths( hostLengths );

    cudaMemcpyToSymbol( lengths, &hostLengths, 100*sizeof(float));



i.e., is it ok to set the same “constants” multiple times?

What is the advantage, the caching mechanism etc.? Are there negatives, positives we should be aware of?

And cudaMemcpyToSymbol says that the “memory areas may not overlap” … isn’t this a pretty easy constraint to meet since the host and device are physically on different chips :-) ?



The upside is that constant access is cached. The downside is that the cache is linear, not 2D like with textures. Constant mem is also limited (as is the cache). See the manual for details.

The “not overlap” means that the memory blocks you are copying to the device may not overlap on the device. For example, you cannot upload a constant block, say A[100] and then at a later kernel call i update it partially, say at A[10+i]. This works with textures or linear device mem. Constant mem has lifetime of the application.


You’re right, for now, this is not a constraint for the reason you state, but the next release of CUDA should allow cudaMemcpyToSymbol() to copy between different areas of device memory too.

Yes, you can. Constant memory has the lifetime of the application means that its content remains the same across kernel calls. The application can update its content at any point in time in-between kernel calls.

Sure. I was referring to partial updates though.