doubt about attributes(global/device)

Henrique_Renno · December 4, 2019, 7:05pm

Hello,

I’m working with a scientific model that contains 3 routines that I want to speed up with CUDA Fortran. I wrote a first version that only uses global kernels, one for each routine, and it is working. However, I wrote another version where each routine is a device subroutine called from a global kernel, but the final results of the model are wrong.

Below is a simplified scheme of both versions, where mainsub() is called from main program, and all subroutines are inside modules and use a module that includes all device variables. Routines must be called sequentially because sub2 depends on sub1, sub3 on sub2, and sub1 on sub3 (in the first call, sub1 uses initial values). I added syncthreads() because I think global kernels have implicit synchronization, whereas device kernels don’t.

Working version:

subroutine mainsub()
    call modelsub()
end subroutine

subroutine modelsub()
    do while(condition)
        call modelsub1<<<blocks,threads>>>()
        call modelsub2<<<blocks,threads>>>()
        call modelsub3<<<blocks,threads>>>()
    enddo
end subroutine

attributes(global) subroutine modelsub1()
end subroutine

attributes(global) subroutine modelsub2()
end subroutine

attributes(global) subroutine modelsub3()
end subroutine

Not working version:

subroutine mainsub()
    call modelsub<<<blocks,threads>>>()
end subroutine

attributes(global) subroutine modelsub()
    do while(condition)
        call modelsub1()
        call syncthreads()
        call modelsub2()
        call syncthreads()
        call modelsub3()
        call syncthreads()
    enddo
end subroutine

attributes(device) subroutine modelsub1()
end subroutine

attributes(device) subroutine modelsub2()
end subroutine

attributes(device) subroutine modelsub3()
end subroutine

What could be the cause for the second version does not work as the first version?

Thanks

aglobus1 · December 4, 2019, 8:26pm

Hi Henrique,

My knowledge of CUF is a little limited, but modelsub is not a kernel (missing the attrribute statement) yet you are calling it like one with the chevron syntax.

subroutine mainsub()
    call modelsub<<<blocks,threads>>>()
end subroutine

subroutine modelsub()
...

“The device attribute, specified on the subroutine or function statement, declares that the subprogram is to be executed on the device; such a routine must be called from a subprogram with the global or device attribute.”

CUDA Fortran Programming Guide Version 19.10 for x86 and NVIDIA Processors

Henrique_Renno · December 4, 2019, 9:25pm

Sorry, I forgot to add that after copying and pasting. Just edited the code in my message.

I’m still debugging the code trying to find the problem, and one weird thing that occurs while debugging is that depending on how many variables are printed inside the kernels, the results that are printed change, as if the print command affects the computations.

aglobus1 · December 4, 2019, 9:57pm

Hi Henrique,

Looking at it again, you’ll need to group those subroutines in an interface. Putting them into a module and use it will work since modules have an implicit interface.

module m
attributes(device) subroutine modelsub1()
...
end subroutine
...
end module

...
attributes(global) subroutine modelsub()
use foo
...
end subroutine

Henrique_Renno · December 5, 2019, 12:24am

I didn’t write the modules here to save space, but all routines are inside modules, and the callers have a “use” for each module that contains a routine to be called. I’ll try changing the code to keep all routines in only one module instead of each routine inside its own module.

Henrique_Renno · December 5, 2019, 7:00pm

I’m trying to run my application with the second version because I don’t know if the overhead of launching kernels can be decreased by defining only the “outer” kernel as global and the other “inner” kernels as device, so that less calls to global kernels are made. Do you know if this change would decrease the overheads? If not, then I should not bother trying to make the second version work.

MatColgrove · December 5, 2019, 7:55pm

I think it would depend on how much work each of the inner kernels have. If they each can fully utilize the GPU, then it’s probably better to leave it as is. Since the launches are done asynchronously, some of the launch latency can be hidden since the CPU will launch the next kernel while the previous one is still running.

If each of the inner kernels are very short or not able to utilize the GPU well, then you should consider using the outer kernel.

-Mat