Scopes, local and global variables

nogradi · March 26, 2007, 1:32pm

Hi all,

I guess I don’t understand some basic issue here about what variable is allowed to live in what scope. For the sake of an example I took the matrix multiplication example from the docs (which works fine) and tried to modify it so that it computes not the matrix product but the matrix product multiplied by a scalar.

The code is left as in the example, only a new variable is introduced, x, which will multiply the product. x lives and is set on the host while cuda_x lives on the device and is set to be equal to x.

This is what I thought would be a good approach:

float x;

__device__ float * cuda_x;

__global__ void Muld( float*, float*, int, int, float* );

void Mul( const float* A, const float* B, int hA, int wA, int wB, float* C )

{

     ........... same as in the docs, plus the following.............

   cudaMalloc( (void**)&cuda_x, sizeof( float ) );

    cudaMemcpy( cuda_x, &x, sizeof( float ), cudaMemcpyHostToDevice );

}

__global__ void Muld( float* A, float* B, int wA, int wB, float* C )

{

     ........... same as in the docs, but the last line is changed to ................

    

     C[c + wB * ty + tx] = (*cuda_x) * Csub;

}

int main( void )

{

     x = 2.0;

    .......... create the matrices on the host...............

     .......... call Mul( ) ...........

     .......... calculate the product on the host too and compare......

}

Without the modification everything works perfectly, but when I include the parameter x and cuda_x I start getting ‘unspecified launch failure’ from cudaGetLastError.

What am I doing wrong?

While trying to debug this I compiled the above code with emulation turned on and there the strange thing was that if the declaration of the variable cuda_x is left as it is

device float * cuda_x;

the emulation version gave a seg fault, while changing the declaration to

float * cuda_x;

it ran fine. Isn’t it the point of the emulation mode that I don’t have to change declarations and such but the compiler figures it out? Probably I’m missing some basic thing here too.

nogradi · March 26, 2007, 2:12pm

Or is it so that any variable that the kernel uses must be passed to it explicitly as an argument?

Mark_Harris · March 27, 2007, 1:09pm

You shouldn’t use the device qualifier on the variable cuda_x.

cudaMalloc is a host function that allocates memory on the device and writes a pointer to that memory into a host variable. Then you need to pass the pointer to the kernel as a parameter. Because the CUDA runtime passes parameters by value to global functions, the (actual) parameter cuda_x needs to be a host variable even though it contains a device address.

So cuda_x should be a host variable, not a device variable.

Mark

nogradi · March 27, 2007, 5:12pm

Thanks, now it makes more sense.

What is still not clear is whether a kernel can use variables that are not passed to it as an argument. So for instance if there is a float that is used by many kernels I thought I would put it into constant memory and each kernel will be able to get it from there and it won’t be necessary to pass it to every kernel as an extra argument. Is this possible? If so, how would such a value be declared and actually how would it be put into constant memory? I understand that for an array one needs to use cudaMalloc but is this the case for a single value as well?

prkipfer · March 28, 2007, 9:19am

Just declare it such that it is visible to the global function. This usually means that you put it in the .cu file right on the top.

__constant__ float deviceA;

Then all following global and device functions can access it by simply using the symbol just like you would expect it from a C const.

result = 2*x*y + deviceA;

Before starting the kernel, the host has to upload a value to the address the symbol points to. As you do not know this address on the host side (the kernel is not running yet) you cannot give it to cudaMemcpy or such. Instead you have to ask CUDA where the symbol will be at run time. The cudaMemcpyToSymbol is you friend.

float hostA = 1.0f;

CUDA_SAFE_CALL(cudaMemcpyToSymbol(deviceA,&hostA,sizeof(float)));

The manual says in 4.2.2.2 that the constant has lifetime of an application, so it lives as long as a global C variable, i.e. it is valid for all kernel invocations.

Peter

nogradi · March 28, 2007, 10:11am

Thanks very much, this clarified a lot. :)

Stewie · April 2, 2007, 3:20pm

So if I want to fill an array of non-constant constants :-), I would do:

__constant__ float lengths[100];

float hostLengths[100];

for ( i=0; i<100; i++ )

{

    FillLengths( hostLengths );

    cudaMemcpyToSymbol( lengths, &hostLengths, 100*sizeof(float));

    kernel<<...>>(...);

}

i.e., is it ok to set the same “constants” multiple times?

What is the advantage, the caching mechanism etc.? Are there negatives, positives we should be aware of?

And cudaMemcpyToSymbol says that the “memory areas may not overlap” … isn’t this a pretty easy constraint to meet since the host and device are physically on different chips :-) ?

Thanks,

Stewie

prkipfer · April 2, 2007, 4:20pm

The upside is that constant access is cached. The downside is that the cache is linear, not 2D like with textures. Constant mem is also limited (as is the cache). See the manual for details.

The “not overlap” means that the memory blocks you are copying to the device may not overlap on the device. For example, you cannot upload a constant block, say A[100] and then at a later kernel call i update it partially, say at A[10+i]. This works with textures or linear device mem. Constant mem has lifetime of the application.

Peter

Cyril_Zeller · April 6, 2007, 1:01pm

You’re right, for now, this is not a constraint for the reason you state, but the next release of CUDA should allow cudaMemcpyToSymbol() to copy between different areas of device memory too.

Cyril_Zeller · April 6, 2007, 1:11pm

Yes, you can. Constant memory has the lifetime of the application means that its content remains the same across kernel calls. The application can update its content at any point in time in-between kernel calls.

prkipfer · April 6, 2007, 3:30pm

Sure. I was referring to partial updates though.

Peter

Topic		Replies	Views
Global arrays? CUDA Programming and Performance	24	10630	August 18, 2010
Defining global variables on the host and device at once? CUDA Programming and Performance	14	14393	December 19, 2020
__constant__ and __device__ memory access CUDA Programming and Performance	4	5948	April 10, 2012
most general form for thread access? CUDA Programming and Performance	10	2531	February 21, 2010
CUDA header input file Declaring device variables in a separate .h or .cuh file CUDA Programming and Performance	7	8501	June 11, 2011
Global memory's lifespan : Application. How do you do that? How could you allocate a variable to CUDA Programming and Performance	17	2668	October 3, 2011
Constant memory confusion CUDA Programming and Performance	11	10890	October 22, 2010
Cuda constant memory CUDA Programming and Performance cuda , kernel	5	2796	September 7, 2023
Can we do malloc inside a __global__ function CUDA Programming and Performance	26	9691	February 21, 2010
Controlling device from host CUDA Programming and Performance	8	2443	August 6, 2008

Scopes, local and global variables

Related topics