Blocks/Threads cause different outcomes

Cattaneo · December 29, 2020, 9:12pm

Hi, I’m concerned I’m misunderstanding how blocks and threads work and I can’t figure out what to do. I have some code which seems to crash when run on multiple blocks of 1 thread each (inefficient I know, but this is just a first version) but runs when it’s 1 block with multiple threads. I know there are warp issues so there are performance losses when using one which is why I’m trying to avoid using just threads. Is there a way to figure out why this crash is happening? The crash in question occurs when I try and do integration of a function but it only happens when I try and do it with multiple blocks. Is there some limit to the types of math a block can do at the same time as another block? I will include the code in question in a reply as it’s a little long.

Cattaneo · December 29, 2020, 9:19pm

Here’s the code in question. Running it in one block with many threads works fine, but running it on multiple blocks with one thread each causes it to crash. Is there an obvious reason this may be crashing that I’m just missing? Something here it cannot easily do in parallel? Or if it’s not immediately obvious is there a tool I could use to debug this easily? Compiling it for gdb causes it to behave differently which is also complicating affairs.

I hope you can forgive the longwinded nature of this question

module GPUFunctions
  use cudafor

  PARAMETER (N=50,NN=50)
  real*8,device :: xvec(NN),yvec(NN)
  real*8,device :: xvec0(NN),yvec0(NN)
  real*8,device :: xvec1(NN),yvec1(NN)

contains
    attributes(device) SUBROUTINE integrate(func,a,b,ss,v) 
              INTEGER KMAX,KMAXP,J,JM
              real*8 :: a,b,ss,tol
              integer :: func,v
              PARAMETER (KMAX=14, KMAXP=KMAX+1, J=5, JM=J-1, tol=1.d-1) !#CONTROL Should work even for EPS=1e-1
              INTEGER k
              real*8 :: dss,h(KMAXP),s(KMAXP)
              

              h(1)=1.d0
              do k=1,KMAX 
                 call halfway(func,a,b,s(k),k,v)
                 if (k.ge.J) then
                    call pol_interpol(h(k-JM),s(k-JM),J,0.d0,ss,dss)
                    if (dabs(dss).le.tol*dabs(ss)) return
                 endif
                 s(k+1)=s(k)
                 h(k+1)=h(k)/9.d0
              end do
              !write(*,*)  'ERROR in subroutine Integrate'
              print *, "ERROR in subroutine integrate"
              print *, "Error on thread", threadIdx%x
              print *, "Error on block", blockIdx%x
              !Keep running on other ones
              stop
            END SUBROUTINE integrate

            attributes(device) SUBROUTINE pol_interpol(xa,ya,n,x,y,dy)
              INTEGER n,NMAX
              real*8 :: dy,x,y,xa(n),ya(n)
              PARAMETER (NMAX=10)
              INTEGER i,m,ns
              real*8 :: den,dif,dift,ho,hp,w,c(NMAX),d(NMAX)
              ns=1
              dif=dabs(x-xa(1))
              do i=1,n
                 dift=abs(x-xa(i))
                 if (dift.lt.dif) then
                    ns=i
                    dif=dift
                 endif
                 c(i)=ya(i)
                 d(i)=ya(i)
              end do
              y=ya(ns)
              ns=ns-1
              do m=1,n-1
                 do i=1,n-m
                    ho=xa(i)-x
                    hp=xa(i+m)-x
                    w=c(i+1)-d(i)
                    den=ho-hp
                    if(den.eq.0.d0)then
                       !write(*,*)  'ERROR in subroutine interpol'
                       print *, "ERROR in subroutine interpol"
                       stop
                    end if
                    den=w/den
                    d(i)=hp*den
                    c(i)=ho*den
                 end do
                 if (2*ns.lt.n-m)then
                    dy=c(ns+1)
                 else
                    dy=d(ns)
                    ns=ns-1
                 endif
                 y=y+dy
              end do
              return
            END SUBROUTINE pol_interpol

          attributes(device) SUBROUTINE halfway(func,a,b,s,n,v)
          INTEGER n
          real*8 :: a,b,s
          integer :: func,v
          INTEGER :: it,j
          real*8 :: ddel,del,sum,tnm,x
          
          if (v == 1) then
            ! print *, "block", blockIdx%x, "has entered halfway"
            ! print *, "a:", a
            ! print *, "b:", b
            v = 1
          endif
          
          Select Case (func)
          
            case (1)
              if (n.eq.1) then
                s=(b-a)*funcexp(0.5d0*(a+b))
              else
                it=3**(n-2)
                tnm=it
                del=(b-a)/(3.d0*tnm)
                ddel=del+del
                x=a+0.5d0*del
                sum=0.d0
                do j=1,it
                  sum=sum+funcexp(x)
                  x=x+ddel
                  sum=sum+funcexp(x)
                  x=x+del
               end do
                s=(s+(b-a)*sum/tnm)/3.d0
              endif
            case (2)
              if (n.eq.1) then
                s=(b-a)*funcp1(0.5d0*(a+b))
              else
                it=3**(n-2)
                tnm=it
                del=(b-a)/(3.d0*tnm)
                ddel=del+del
                x=a+0.5d0*del
                sum=0.d0
                do j=1,it
                  sum=sum+funcp1(x)
                  x=x+ddel
                  sum=sum+funcp1(x)
                  x=x+del
               end do
                s=(s+(b-a)*sum/tnm)/3.d0
              endif
            case (3)
              if (n.eq.1) then
                s=(b-a)*funcp0(0.5d0*(a+b))
              else
                it=3**(n-2)
                tnm=it
                del=(b-a)/(3.d0*tnm)
                ddel=del+del
                x=a+0.5d0*del
                sum=0.d0
                do j=1,it
                  sum=sum+funcp0(x)
                  x=x+ddel
                  sum=sum+funcp0(x)
                  x=x+del
               end do
                s=(s+(b-a)*sum/tnm)/3.d0
              endif
            case (4)
              if (n.eq.1) then
                s=(b-a)*funcxi(0.5d0*(a+b))
              else
                it=3**(n-2)
                tnm=it
                del=(b-a)/(3.d0*tnm)
                ddel=del+del
                x=a+0.5d0*del
                sum=0.d0
                do j=1,it
                  sum=sum+funcxi(x)
                  x=x+ddel
                  sum=sum+funcxi(x)
                  x=x+del
               end do
                s=(s+(b-a)*sum/tnm)/3.d0
              endif
            case default
                return
          end select
          return
          END SUBROUTINE halfway
end module GPUFunctions

Cattaneo · January 1, 2021, 7:27pm

An update for anyone who was wondering. The problem also occurs if I go over 32 threads, which presumably if I’m understanding correctly means that it’s a warp issue. My issue still stands, why does having multiple warps cause it to fail. Can multiple warps not do the same math? Or is there an issue with the way warps access memory and having multiple ones trying to integrate is a no go?

Robert_Crovella · January 1, 2021, 7:51pm

A few suggestions, in no particular order:

using the word “crash” by itself is not sufficiently communicative. It’s better if you describe the outcome completely, or even better, provide the exact output observed, when the “crash” occurs.
even though you didn’t get anyone diving in to do a full debug on your previous request, I still strongly encourage people who are asking for debug assistance to provide a complete code. What you have shown here has no host code and not even the kernel framework, so its really difficult to guess what might be the issue. You could easily be making an error in something you haven’t shown. In addition, its best if you reduce your code as much as possible, while still preserving the error to be debugged. This is such a common request/expectation for these types of activities that there are even webpages describing the best case behavior for someone asking for help like this.
In your previous question, I demonstrated the use of cuda-memcheck for you. In the “crashing” case here, have you run your code under cuda-memcheck or the newer equivalent, compute-sanitizer ? These are powerful tools. If one of them is reporting an issue, you should not have to stumble around in the dark. It gives you something concrete to discuss and chase. Furthermore I strongly encourage that these tools be used before asking for help. If no errors are reported, that is useful info for those trying to help you. Even if you don’t understand the output, providing it along with your question is useful info for those trying to help you.

Speaking for myself, I’m more likely to spend time with a request like this if you provide a complete code. And I am more likely to try some debugging on 100 lines of code vs. 400.

Do as you wish, of course. Just offering suggestions.

Cattaneo · January 1, 2021, 10:51pm

I shall do that as soon as possible. The holidays have prevented me from putting in as much work as I should, so I made the post mainly to keep track of where I am if that makes sense.

Also, forgive me if it came across as a request for debugging, that wasn’t the intention. It was more a request about whether threads/blocks have limitation that may be immediately obvious that I’ve missed. I apologize for confusion or frustration I may have caused.