What you wrote compiles to more than a single instruction really.
First, the hardware will issue a read that will place the value of b[tid] in a register (implicitly). Then it will do the same for c[tid]. This is because we can’t do arithmetic or logic operations on anything but registers (and perhaps shared memory but I’d bet not). And finally, it will perform a mul on the two new registers. So it basically translates the above into something like this (conceptually):
double a;
double implicitB = load( b[tid] );
double implicitC = load( c[tid] );
a = b * c;
Now, to be fair, something like load( b[tid] ) actually compiles to 3 PTX instructions.
-
load the address of array b into a register (let’s say r1)
-
since b[tid] is really the address of b plus the offset given by tid, here’s an addition, and the result of that addition goes to r2.
-
implicitB = load( r2 );
(and that’s assuming we already have tid in some register)
Here’s a PTX to see how it looks through a looking glass
ld.param.u32 %r3, [__cudaparm__Z5amulbPfS_S__b];
add.u32 %r4, %r3, %r2;
ld.global.f32 %f1, [%r4+0];
ld.param.u32 %r5, [__cudaparm__Z5amulbPfS_S__c];
add.u32 %r6, %r5, %r2;
ld.global.f32 %f2, [%r6+0];
mul.f32 %f3, %f1, %f2;
First three instructions do the load in the way I just described, that is broken down into:
-
load the address passed in function parameter into a register (here, into %r3)
-
add to this our tid (tid sits in %r2) giving %r4 - the address of the specific element from b array that we want to read (The syntax for “x = y + z” is “add.u32 x, y, z”)
-
read the value pointed by %r4 from global memory into a local floating point register %f1. The [%r4+0] notation means we don’t apply any fixed offset to the addressing.
Then we do the same to c[tid] and finally we do a mul.f32, on those two registers we just read from global mem and giving output to %f3. %f3 is our ‘a’ variable (note, I actually used single precision floats here).
Now this is how it would look in hardware as well if it wasn’t massively parallel. Notice a funny thing in the code above - there’s nothing telling you that there are multiple threads performing this.
It gets slightly more complicated by the fact that we do threads in warps. PTXes are not the actual assembly and this gets later compiled to machine code, during which process some wild things may happen. Perhaps decuda demystifies this but I’ve never toyed with it. PTX is as low as I’m willing to go at this point :)
BTW: If I totally missed the point and gave an unnecessary lecture on assembly coding while you wanted the sub-assembly hardware stuff, then sorry, my bad :)