Define variable as on device within device code?

This is kinda a silly question, but I can’t quite figure it out just reading documentation. If I have a subroutine

Attributes(device) subroutine test(x)
!variable declaration
a = x * 2
b = a * 2

is there a difference between declaring variables within it
real, device :: a,b
real, device :: x

and

real :: a,b
real :: x

I’ve tested it a little and I can’t seem to find much of a difference, but I’d like to make sure before I do anything stupid.

Technically by adding the “device” attribute you as saying that these variables should be declared in the device’s global memory. Though since they are local scalars, they must be declared in thread local storage so the “device” attribute is essentially ignored.

Here’s the relevant section in the CUDA Fortran Programming Guide:

Variables declared in a device program units may have one of three new attributes: they may be declared to be in device global memory, in constant memory space, in the thread block shared memory, or without any additional attribute they will be allocated in thread local memory. For performance and useability reasons, the value attribute can also be used on scalar dummy arguments so they are passed by value, rather than the Fortran default to pass arguments by reference.

.3.2.1. Device data

A variable or array with the device attribute is defined to reside in the device global memory. The device attribute can be specified with the attributes statement, or as an attribute on the type declaration statement. The following example declares two arrays, a and b, to be device arrays of size 100.

-Mat

Would one need to do it for, say, arrays that are passed up and down between multiple subroutines and functions?

I’m not 100% sure on that, but believe that when calling a device routine from another, the “device” attribute on an array would be implied.

Are you encountering an example that’s not working as expected if “device” is not used?

Not that I can think of, just thought it would be wise to ask just in case. Never hurts to figure out what’s going on at a lower level methinks.

That said I do have a semi-related question about memory locations, should I ask that in a new thread or just put it here?

It’s only really important to device the memory attribute in the “global” routines. “device” for passed in arguments so the data type checking can be correct, and “shared” when you want to use CUDA shared memory. At the “device” level, the program is passing around device pointers and which memory the pointer is stored, global, shared, or even from the stack, would get resolved by the hardware.

That said I do have a semi-related question about memory locations, should I ask that in a new thread or just put it here?

Either way.

When I have multiple threads running doing the same calculation if I do loc(variable) they all output the same location. Is that supposed to be the case? Is the “same” variable across multiple threads stored in the same location in memory? Or is loc the incorrect thing to use to find out where it’s located. I’ve intended to use loc to see if the correct thing was being passed around, but the fact everything is giving the same is throwing me a bit.

For global arrays, this would make sense, though I’ve not tried using LOC on a local variable before. Just tired and see the same thing.

My best guess is that since local variables are stored in registers, the addresses may look the same but resolve to an individual thread’s register. If you need a more definitive answer, I’ll need to do some research to confirm.

No that’s fine, that makes sense. Is there an easy way then to check that the proper chunk of memory is being moved around? Or that it’s not crossing variables? I just wanted to make sure I wrote it all up correctly so being able to check that things are different or the same in location is helpful, but if there’s no easy way that’s fine.

It’s not something that I’ve even worried about, at least not at runtime, so haven’t really thought about this. Local variables are private so there’s no possibility of crossing. Gang private variables are shared amongst the vectors in the gang so there’s some possibility, but you should be able to inspect the code rather than checking addresses to ensure that the proper index is being accessed by each vector. Global variables are shared, but again, you can look at how indices are being accessed to ensure that there’s no race condition. If there is a collision, you’d use atomics to ensure the read/writes are seen by all other threads.

Okay, good to know. I appreciate your time.