GPU parallel version of code slower?

I’ve got some code that I believe I’ve written correctly, but it’s running slower than the CPU version. I know that some tasks are ill suited for GPUs, but I can’t figure out if this is one of those or if I’ve made a mistake somewhere or am missing something. I’ve included the code below. I apologise if this answer is obvious, I just can’t seem to find where the problem lies.

This is the CPU version I’ve got:

subroutine branchingCPUtest(x,b,c,d)
implicit none
integer :: i,n
real*8 :: x(:), b(:), c(:), d(:)
n = size(x)
do i = 1, n
call branchedCPUloop(x(i),b(i),c(i),d(i))
enddo
end subroutine branchingCPUtest

subroutine branchedCPUloop(x,b,c,d)
implicit none
integer :: i
real*8 :: n, x,b,c,d
do i = 1, 50
n = x * b * c *d * i
!print *, n
enddo
end subroutine branchedCPUloop

This is the GPU version

attributes (global) subroutine branchingGlobalCall(x,b,c,d)
implicit none
integer :: i,n,var
real*8 :: x(:),b(:),c(:),d(:)
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
var = blockIdx%x
!print *, var
call branchingDeviceCalls(x(var),b(var),c(var),d(var))
end subroutine branchingGlobalCall

attributes(device) subroutine branchingDeviceCalls(x,b,c,d)
implicit none
integer :: i,lowerThreadBound,upperThreadBound,blockNum, blockSize
real*8 :: x,n,b,c,d
blockNum = blockIdx%x
blockSize = blockDim%x
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
!32 because we chose 32 for thread size but we need a better way to handle this. Probably GridDim and BlockDim
lowerThreadBound = 1-blockSize +(blockSize *blockNum)
upperThreadBound = lowerThreadBound + 64 -1
!print *, “Running on block”, blockNum
!print *, “Using variable”, x
!print *, “block”, blockNum, “has bounds”, lowerThreadBound,UpperThreadBound
if ((i >= lowerThreadBound).and.(i <= upperThreadBound).and.(threadIdx%x <= 50)) then
!print *, threadIdx%x
n = x * b * c *d * threadIdx%x
endif

end subroutine branchingDeviceCalls

And here is the initialization of both. The arrays are filled with random real numbers for this example. The loop size is 50 for this example but this is arbitrary and so is the size of the array.

real8 :: deviceArray(20), deviceArrayB(20), deviceArrayC(20), deviceArrayD(20)
real8, device :: deviceArray_d(20), deviceArray_dB(20), deviceArray_dC(20),
deviceArray_dD(20)

call CPU_TIME(start)
deviceArray_d = deviceArray
deviceArray_dB = deviceArrayB
deviceArray_dC = deviceArrayC
deviceArray_dD = deviceArrayD
! GridDim,BlockDim
call branchingGlobalCall<<<20,64>>>(deviceArray_d,deviceArray_dB,deviceArray_dC,deviceArray_dD)
call CPU_TIME(finish)
print * , "GPU test took ", finish-start

call CPU_TIME(start)
call branchingCPUtest(deviceArray,deviceArrayB,deviceArrayC,deviceArrayD)
call CPU_TIME(finish)
print * , "CPU test took ", finish-start

Am I doing something wrong that causes the GPU version to be consistently slower?

Bumping with some new information. This version does run faster for much larger array sizes or more loops within the device call , but I’m still not sure whether or not the speed issue is one of size not being big enough to benefit from being on a gpu or something I did wrong within the code itself.