How to do efficient vectorized loads in CUDA Fortran

adench2 · August 12, 2021, 12:43am

Hello!

Context: As a deep dive learning experience I’ve been continually optimizing the reduction_sum kernel for arrays of floats. My fastest implementation involves register shuffles and atomicAdds, and is about 3.5x faster than the GPU-overloaded sum(devArr) intrinsic function:

Nsight-Compute profiling tells me that the current bottleneck in my kernel is waiting for data to be fetched from global memory; I’ve identified the line in source and everything. The only thing left I can think of (other than pure ptx management) is implementing vectorized float2/float4 loads; easy in CUDA C++.

Fortran doesn’t have the int2, int4, or float2, float4 data types that C++ does. Regardless, I would like to get the benefits of vectorized loads. My hope was that if I made an array twoint(2), that vectorized loads would be performed during the compilation stage if I did

twoint(:) = global_arr(idx:idx+1)

However, doing this doesn’t seem to yield speedup on my GTX 1650 (which to be fair might not support vectorized loads; I’m not sure where to look for that info). Using Nsight-Compute, I actually seem to ruin my coalesced accessing of global array data, and I get a slight performance hit.

So my question is how one would go about ensuring the compiler performs vectorized loads (if possible)? Would I have to enter in PTX for the vectorized loads manually? If so, what would that look like? An example in a CUDA Fortran kernel would be nice.

Something that may be helpful is that Fortran does have the transfer() function, which plays the role of reinterpret_cast<>. But I’m not sure how to use it in this regard.

bleback · August 12, 2021, 4:08pm

There is a way, but we haven’t really productized it, as we have the usual concerns about alignment when creating a general solution. There is some mention of our Fortran vector types in the wmma documentation and examples. The best method is to rely on low-level CUDA C or asm to be sure, but fortunately, there is an existing interface to that which you can create from Fortran. Here is an example:
module mtests
interface
attributes(device) subroutine assign16(y,x) bind(c,name=‘__pgi_assgn_v4real4’)
integer(4), device :: y(), x()
end subroutine
end interface
interface
attributes(device) subroutine assign16r4(y,x) bind(c,name='__pgi_assgn_v4real4
')
real(4), device :: y(), x()
end subroutine
end interface

contains
attributes(global) subroutine testany( b, a )
integer, device :: a(), b()
integer(4) :: x(4)
real, device :: ia(4), ib(4)
pointer(pa,ia)
pointer(pb,ib)
call assign16(b,a)
pa = loc(a(9))
pb = loc(b(9))
call assign16r4(ib,ia)
return
end subroutine
end module mtests

program t
use mtests
use cudafor
integer, allocatable, device :: a(:), b(:)
integer c(12)
allocate(a(12))
allocate(b(12))
a = -99
b = 0
call testany<<<1,1>>> (b,a)
istat = cudaDeviceSynchronize()
c = b
print *,“C:”
print *,c
end

In general, just bind your Fortran device function to the __pgi_assgn* names. Here’s the generated ptx for the kernel:
ld.param.u64 %rd1, [mtests_testany__param_0];
ld.param.u64 %rd2, [mtests_testany__param_1];
$L__tmp0:
cvta.to.global.u64 %rd3, %rd2;
cvta.to.global.u64 %rd4, %rd1;
ld.global.v4.u32 {%r1, %r2, %r3, %r4}, [%rd3];
st.global.v4.u32 [%rd4], {%r1, %r2, %r3, %r4};
ld.global.v4.u32 {%r9, %r10, %r11, %r12}, [%rd3+32];
st.global.v4.u32 [%rd4+32], {%r9, %r10, %r11, %r12};

Looking at the ptx is probably the best way to make sure you are generating vector loads, which you can do with the -gpu=keepptx compiler option.

adench2 · August 13, 2021, 12:35am

Hey, thanks for the assistance! I got your example working after some tinkering (perhaps a copy-paste issue?). In your example, was the integer(4) :: x(4) in the global subroutine supposed to be used for something?

Either way, your example is now working and ensuring the vector loads; my PTX file of your example has the indicative ld.global.v4.u32 calls. Honestly, maybe putting in a line of ptx into the code would be easier than specifically writing the interface to the __pgi_assgn_v4real4 subroutines plus the pointer magic.

Is there any documentation on the __pgi_assgn_v4real4 etc functions?

I do hope the Fortran vector types evolve to make vector loads easier though.

Anyways, I’ve tried implementing this in the sum reduction case. My performance has tanked:

The “vectorized” kernel call is the one implementing this attempt at loading 4 floats at a time. The Nsight_Compute profiling seems to indicate there are now many more memory stalls. It doesn’t seem to be occupancy-related, but I’m not quite used to the Nsight_Compute profiler, so I’m not sure exactly where to look. I see a lot more Stall MIO Throttle, and even more Stall Long Scoreboard than before. The PTX of my real_vec4_assign function is as follows:

I can see the vectorized load instructions there, so the performance has to be tanking somewhere else. The biggest source of stall appears to be in the line:

This is primarily composed of those Stall Long Scoreboard stalls, which as far as I know refers to waiting on memory loads to occur. Is the pgi_assgn_* function not optimized in terms of memory access, or is the setup ruining global memory coalescence accesses somehow? In my less-optimized codes where global memory accesses weren’t fully coalesced, I was only getting around a 2x slowdown at worst; so I don’t think that’s the whole story.

bleback · August 17, 2021, 6:02pm

I am not an expert on vector load performance, so you might try other forums. But, I am wondering why you have a vec_assigns_… function. You want the vector loads and stores to be inlined into your code, not in a separate function. If you are making an actual call, that could cause your slowdown.

adench2 · August 17, 2021, 8:58pm

Hi @bleback,
Thanks for the reply! You are completely right. When I implement the same thing inlined, the severe reduction in performance goes away. I should have figured!

As you can see, the (fixed) version, “vectorized2”, doesn’t suffer any performance penalty. Neither does it get a boost in performance though.

Looking at the PTX, the “vectorized2” version indeed uses the vectorized loading instructions. It also has a couple more instructions than the “device_reduce_warp_” version whose performance it matches, so maybe some more tinkering will work.

adench2 · August 17, 2021, 9:36pm

EDIT: The below is incorrect - I was getting wrong results on my unit tests it seems. See my next post.
I modified the vectorized kernel to reduce its instruction dependency and do more arithmetic between loads, thanks to the vectorized loads giving me more data at once. I’m kind of shocked that I can still get decent speedups still, with another 20% reduction here.

Interestingly, all the stalls now appear to be in the _shfl instructions:

It says that 98% of my stalls come from “Stall Drain” stalls on the above shfl_down instruction line, which it says are when we’re waiting for memory to be freed after an EXIT code is called. Since the source counter from Nsight_Compute is pointing towards this shfl instruction, I’m not sure if I can do anything more at this point, since for all I know the EXIT call it’s talking about for the stall is from the __shfl instruction itself.

Either way, thanks for the help everyone! If you have any insight on the above I’d happily take any feedback on removing the above bottleneck if it’s possible at this point!

bleback · August 17, 2021, 9:43pm

Not sure which architecture you are on, but there is a new warp reduction instruction in A100, which might be a bit faster. I guess only for integers, though.

adench2 · August 17, 2021, 10:05pm

It seems that I needed to check my unit tests, actually. Wasn’t getting the correct results, and the speedup was an error that I rationalized.

My speeds even with the vectorized load instructions are basically the same:

Strangely, my “Stall Long Scoreboard” statistics have gotten worse for the vectorized load kernel:

at about 50 cycles per instruction, whereas the un-vectorized one is about 26 cycles per instruction of “Stall Long Scoreboard” time. Yet somehow they’re at the same performance. Interesting! I definitely get the vectorized PTX load instructions, but maybe the assembly ends up the same on my 1650 or something. I don’t know what the assembly looks like for doing vectorized loads.

I’ve definitely learned a fair amount about using Nsight_Compute for profiling from all this.