I’m trying to get more familiar with CUDA by writing little test routines. Unfortunately I can’t seem to figure out where the problem is, I suspect it’s a very basic question, but I just can’t quite get a grasp on it.

The test program initiates an array 1000 long filled with random real numbers and then generates a random multiplier. Then the function I have written should go through the array and multiply all the contents by the multiplier one by one. Ideally each thread would handle 2 multiplications, which would ideally mean that 500 multiplications get done in one step. Unfortunately that doesn’t appear to be the case and I’m not sure why. Again, I apologise for the basic nature of this question but the documentation I have isn’t really helping. It compiles just fine and seems to run without a hitch and it even actually runs the subroutines as a print statement inside the subroutine will be called, so I’m guessing it’s an issue of my misunderstanding threads/blocks and using them as indexes for operations.

My code is below:

attributes(global) subroutine globalReferencePass(x, a)

implicit none

integer :: i, n

real :: x(:), a

n = size(x)

i=threadIdx%x + blockIdx%x * blockDim%x

x(i) = x(i) * a

end subroutine globalReferencePass

!initializing everything in main program

x_d1 = x1

a_d = randMulti

call globalReferencePass<<<500,2>>>(x_d1, a_d)

x1 = x_d1

As far as I can tell, going into x1 shows that it isn’t working quite correctly.