PTX arrays as operands not working

The PTX manual (version 2.3) states:

Array elements can be accessed using an explicitly calculated byte address, or by indexing into the array using square-bracket notation. The expression within square brackets is either a constant integer, a register variable, or a simple “register with constant offset” expression, where the offset is a constant expression that is either added or subtracted from a register variable. If more complicated indexing is desired, it must be written as an address calculation prior to use.

ld.global.u32 s, a[0];
ld.global.u32 s, a[N-1];
mov.u32 s, a[1]; // move address of a[1] into s

When I try this I can only get the version pointer plus byte offset to work, i.e. [a+0] but not a[0].

For example, this code fails to load:

.reg .f32 f;
.global .f32 a[10];
ld.global.f32 f0,a[0];

Whereas this loads fine:

.reg .f32 f;
.global .f32 a[10];
ld.global.f32 f0,[a+0];

The problem with the byte offset version is that it really is a byte offset. So, one has to take the underlying size of the type into account, i.e. the second element is [a+4]. Whereas a[1] is supposed to work this out for you.

And there is an even more severe issue here involved: The above text states that a register variable can be used to index the array, like:

ld.global.f32 f0,a[u0];

where u0 is probably a .reg.u32 or some other compatible integer.

However, with the pointer plus byte offset method this is not possible. It is illegal to do something like:

mul.u32 u1,u0,4;
ld.global.f32 f0,[a+u1]; // here a reg variable is not allowed.

Now this is a severe limitation. however, one can do another address calculation prior to the load statement. But this complicates things.

Ideas what’s going wrong?

Thanks for alerting us to this issue. The manual does not document the addressing syntax correctly, but at this point I do not know the proper syntax for the various supported addressing variants. If I find out, I will post it here.

It seems this question was also posted to StackOverflow. In general I think it is helpful to note that in the form of a cross reference.

The PTX documentation will be corrected in a future CUDA release. PTX requires square brackets around address expressions in ld/st/etc. instructions, as you noticed when you found that [a+0] is accepted, but a[0] is not accepted.

The expression (addressing mode) has limited forms, e.g. immediate, variable name, register, sym/reg+ImmOffset. As already noted above, it is possible to do things like ld.const.u32 %r,[ptr+10*4], since simple constant folding is provided during PTX processing.

To access array element A[i], you need to generate the low-level code that computes A+(typesize*i) to get the byte address, as noted in the current documentation. Keep in mind that PTX is for the most part a thin abstraction layer on top of GPU hardware instructions. Our GPU instruction sets are RISC oriented, and unlike x86 (e.g. SIB encoding) do not provide things like scaling as part of the address mode.