I’m trying to get more familiar with CUDA by writing little test routines. Unfortunately I can’t seem to figure out where the problem is, I suspect it’s a very basic question, but I just can’t quite get a grasp on it.
The test program initiates an array 1000 long filled with random real numbers and then generates a random multiplier. Then the function I have written should go through the array and multiply all the contents by the multiplier one by one. Ideally each thread would handle 2 multiplications, which would ideally mean that 500 multiplications get done in one step. Unfortunately that doesn’t appear to be the case and I’m not sure why. Again, I apologise for the basic nature of this question but the documentation I have isn’t really helping. It compiles just fine and seems to run without a hitch and it even actually runs the subroutines as a print statement inside the subroutine will be called, so I’m guessing it’s an issue of my misunderstanding threads/blocks and using them as indexes for operations.
My code is below:
attributes(global) subroutine globalReferencePass(x, a)
implicit none
integer :: i, n
real :: x(:), a
n = size(x)
i=threadIdx%x + blockIdx%x * blockDim%x
x(i) = x(i) * a
end subroutine globalReferencePass
!initializing everything in main program
x_d1 = x1
a_d = randMulti
call globalReferencePass<<<500,2>>>(x_d1, a_d)
x1 = x_d1
As far as I can tell, going into x1 shows that it isn’t working quite correctly.