 Hi, I’m concerned I’m misunderstanding how blocks and threads work and I can’t figure out what to do. I have some code which seems to crash when run on multiple blocks of 1 thread each (inefficient I know, but this is just a first version) but runs when it’s 1 block with multiple threads. I know there are warp issues so there are performance losses when using one which is why I’m trying to avoid using just threads. Is there a way to figure out why this crash is happening? The crash in question occurs when I try and do integration of a function but it only happens when I try and do it with multiple blocks. Is there some limit to the types of math a block can do at the same time as another block? I will include the code in question in a reply as it’s a little long.

Here’s the code in question. Running it in one block with many threads works fine, but running it on multiple blocks with one thread each causes it to crash. Is there an obvious reason this may be crashing that I’m just missing? Something here it cannot easily do in parallel? Or if it’s not immediately obvious is there a tool I could use to debug this easily? Compiling it for gdb causes it to behave differently which is also complicating affairs.

I hope you can forgive the longwinded nature of this question

``````module GPUFunctions
use cudafor

PARAMETER (N=50,NN=50)
real*8,device :: xvec(NN),yvec(NN)
real*8,device :: xvec0(NN),yvec0(NN)
real*8,device :: xvec1(NN),yvec1(NN)

contains
attributes(device) SUBROUTINE integrate(func,a,b,ss,v)
INTEGER KMAX,KMAXP,J,JM
real*8 :: a,b,ss,tol
integer :: func,v
PARAMETER (KMAX=14, KMAXP=KMAX+1, J=5, JM=J-1, tol=1.d-1) !#CONTROL Should work even for EPS=1e-1
INTEGER k
real*8 :: dss,h(KMAXP),s(KMAXP)

h(1)=1.d0
do k=1,KMAX
call halfway(func,a,b,s(k),k,v)
if (k.ge.J) then
call pol_interpol(h(k-JM),s(k-JM),J,0.d0,ss,dss)
if (dabs(dss).le.tol*dabs(ss)) return
endif
s(k+1)=s(k)
h(k+1)=h(k)/9.d0
end do
!write(*,*)  'ERROR in subroutine Integrate'
print *, "ERROR in subroutine integrate"
print *, "Error on block", blockIdx%x
!Keep running on other ones
stop
END SUBROUTINE integrate

attributes(device) SUBROUTINE pol_interpol(xa,ya,n,x,y,dy)
INTEGER n,NMAX
real*8 :: dy,x,y,xa(n),ya(n)
PARAMETER (NMAX=10)
INTEGER i,m,ns
real*8 :: den,dif,dift,ho,hp,w,c(NMAX),d(NMAX)
ns=1
dif=dabs(x-xa(1))
do i=1,n
dift=abs(x-xa(i))
if (dift.lt.dif) then
ns=i
dif=dift
endif
c(i)=ya(i)
d(i)=ya(i)
end do
y=ya(ns)
ns=ns-1
do m=1,n-1
do i=1,n-m
ho=xa(i)-x
hp=xa(i+m)-x
w=c(i+1)-d(i)
den=ho-hp
if(den.eq.0.d0)then
!write(*,*)  'ERROR in subroutine interpol'
print *, "ERROR in subroutine interpol"
stop
end if
den=w/den
d(i)=hp*den
c(i)=ho*den
end do
if (2*ns.lt.n-m)then
dy=c(ns+1)
else
dy=d(ns)
ns=ns-1
endif
y=y+dy
end do
return
END SUBROUTINE pol_interpol

attributes(device) SUBROUTINE halfway(func,a,b,s,n,v)
INTEGER n
real*8 :: a,b,s
integer :: func,v
INTEGER :: it,j
real*8 :: ddel,del,sum,tnm,x

if (v == 1) then
! print *, "block", blockIdx%x, "has entered halfway"
! print *, "a:", a
! print *, "b:", b
v = 1
endif

Select Case (func)

case (1)
if (n.eq.1) then
s=(b-a)*funcexp(0.5d0*(a+b))
else
it=3**(n-2)
tnm=it
del=(b-a)/(3.d0*tnm)
ddel=del+del
x=a+0.5d0*del
sum=0.d0
do j=1,it
sum=sum+funcexp(x)
x=x+ddel
sum=sum+funcexp(x)
x=x+del
end do
s=(s+(b-a)*sum/tnm)/3.d0
endif
case (2)
if (n.eq.1) then
s=(b-a)*funcp1(0.5d0*(a+b))
else
it=3**(n-2)
tnm=it
del=(b-a)/(3.d0*tnm)
ddel=del+del
x=a+0.5d0*del
sum=0.d0
do j=1,it
sum=sum+funcp1(x)
x=x+ddel
sum=sum+funcp1(x)
x=x+del
end do
s=(s+(b-a)*sum/tnm)/3.d0
endif
case (3)
if (n.eq.1) then
s=(b-a)*funcp0(0.5d0*(a+b))
else
it=3**(n-2)
tnm=it
del=(b-a)/(3.d0*tnm)
ddel=del+del
x=a+0.5d0*del
sum=0.d0
do j=1,it
sum=sum+funcp0(x)
x=x+ddel
sum=sum+funcp0(x)
x=x+del
end do
s=(s+(b-a)*sum/tnm)/3.d0
endif
case (4)
if (n.eq.1) then
s=(b-a)*funcxi(0.5d0*(a+b))
else
it=3**(n-2)
tnm=it
del=(b-a)/(3.d0*tnm)
ddel=del+del
x=a+0.5d0*del
sum=0.d0
do j=1,it
sum=sum+funcxi(x)
x=x+ddel
sum=sum+funcxi(x)
x=x+del
end do
s=(s+(b-a)*sum/tnm)/3.d0
endif
case default
return
end select
return
END SUBROUTINE halfway
end module GPUFunctions``````

An update for anyone who was wondering. The problem also occurs if I go over 32 threads, which presumably if I’m understanding correctly means that it’s a warp issue. My issue still stands, why does having multiple warps cause it to fail. Can multiple warps not do the same math? Or is there an issue with the way warps access memory and having multiple ones trying to integrate is a no go?

A few suggestions, in no particular order:

• using the word “crash” by itself is not sufficiently communicative. It’s better if you describe the outcome completely, or even better, provide the exact output observed, when the “crash” occurs.

• even though you didn’t get anyone diving in to do a full debug on your previous request, I still strongly encourage people who are asking for debug assistance to provide a complete code. What you have shown here has no host code and not even the kernel framework, so its really difficult to guess what might be the issue. You could easily be making an error in something you haven’t shown. In addition, its best if you reduce your code as much as possible, while still preserving the error to be debugged. This is such a common request/expectation for these types of activities that there are even webpages describing the best case behavior for someone asking for help like this.

• In your previous question, I demonstrated the use of `cuda-memcheck` for you. In the “crashing” case here, have you run your code under `cuda-memcheck` or the newer equivalent, `compute-sanitizer` ? These are powerful tools. If one of them is reporting an issue, you should not have to stumble around in the dark. It gives you something concrete to discuss and chase. Furthermore I strongly encourage that these tools be used before asking for help. If no errors are reported, that is useful info for those trying to help you. Even if you don’t understand the output, providing it along with your question is useful info for those trying to help you.

Speaking for myself, I’m more likely to spend time with a request like this if you provide a complete code. And I am more likely to try some debugging on 100 lines of code vs. 400.

Do as you wish, of course. Just offering suggestions.

I shall do that as soon as possible. The holidays have prevented me from putting in as much work as I should, so I made the post mainly to keep track of where I am if that makes sense.

Also, forgive me if it came across as a request for debugging, that wasn’t the intention. It was more a request about whether threads/blocks have limitation that may be immediately obvious that I’ve missed. I apologize for confusion or frustration I may have caused.