I’ve got some code that I believe I’ve written correctly, but it’s running slower than the CPU version. I know that some tasks are ill suited for GPUs, but I can’t figure out if this is one of those or if I’ve made a mistake somewhere or am missing something. I’ve included the code below. I apologise if this answer is obvious, I just can’t seem to find where the problem lies.
This is the CPU version I’ve got:
subroutine branchingCPUtest(x,b,c,d)
implicit none
integer :: i,n
real*8 :: x(:), b(:), c(:), d(:)
n = size(x)
do i = 1, n
call branchedCPUloop(x(i),b(i),c(i),d(i))
enddo
end subroutine branchingCPUtestsubroutine branchedCPUloop(x,b,c,d)
implicit none
integer :: i
real*8 :: n, x,b,c,d
do i = 1, 50
n = x * b * c *d * i
!print *, n
enddo
end subroutine branchedCPUloop
This is the GPU version
attributes (global) subroutine branchingGlobalCall(x,b,c,d)
implicit none
integer :: i,n,var
real*8 :: x(:),b(:),c(:),d(:)
n = size(x)
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
var = blockIdx%x
!print *, var
call branchingDeviceCalls(x(var),b(var),c(var),d(var))
end subroutine branchingGlobalCallattributes(device) subroutine branchingDeviceCalls(x,b,c,d)
implicit none
integer :: i,lowerThreadBound,upperThreadBound,blockNum, blockSize
real*8 :: x,n,b,c,d
blockNum = blockIdx%x
blockSize = blockDim%x
i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
!32 because we chose 32 for thread size but we need a better way to handle this. Probably GridDim and BlockDim
lowerThreadBound = 1-blockSize +(blockSize *blockNum)
upperThreadBound = lowerThreadBound + 64 -1
!print *, “Running on block”, blockNum
!print *, “Using variable”, x
!print *, “block”, blockNum, “has bounds”, lowerThreadBound,UpperThreadBound
if ((i >= lowerThreadBound).and.(i <= upperThreadBound).and.(threadIdx%x <= 50)) then
!print *, threadIdx%x
n = x * b * c *d * threadIdx%x
endifend subroutine branchingDeviceCalls
And here is the initialization of both. The arrays are filled with random real numbers for this example. The loop size is 50 for this example but this is arbitrary and so is the size of the array.
real8 :: deviceArray(20), deviceArrayB(20), deviceArrayC(20), deviceArrayD(20)
real8, device :: deviceArray_d(20), deviceArray_dB(20), deviceArray_dC(20),
deviceArray_dD(20)call CPU_TIME(start)
deviceArray_d = deviceArray
deviceArray_dB = deviceArrayB
deviceArray_dC = deviceArrayC
deviceArray_dD = deviceArrayD
! GridDim,BlockDim
call branchingGlobalCall<<<20,64>>>(deviceArray_d,deviceArray_dB,deviceArray_dC,deviceArray_dD)
call CPU_TIME(finish)
print * , "GPU test took ", finish-startcall CPU_TIME(start)
call branchingCPUtest(deviceArray,deviceArrayB,deviceArrayC,deviceArrayD)
call CPU_TIME(finish)
print * , "CPU test took ", finish-start
Am I doing something wrong that causes the GPU version to be consistently slower?