Recently, I just found float3 was as fast as float4. It looks weird based on my understanding. For the ptx file, float3 will use st.global.f32 or ld.global.f32, while st.global.v4.u32 or ld.global.v4.u32 for float4. Attached is my test cu file, which was compiled on rtx 3060 with nvcc 11.5.
The output :
Elapsed time for float3: 11.52 ms
Elapsed time for float4: 11.52 ms
Directly loading and storing float3 from global memory is not ideal.
I would let the 32 threads of each warp cooperate:
Load the first, second and third 32 floats normally as float array, not float3 array (thread 0 loads first float - effectively [0].x, thread 1 loads second float [0].y, …)
Store them into a shared memory array of size 96.
Then read back as float3 type or the three components separately (should give the same code). You can do it with union or with reinterpret_cast (which actually is UB, but typically works in Cuda).
Alternatively store your float3 as 3 separate arrays for x, y, z.
Yes, that was my thought as well. But my test result shows that float3 is as fast as float4. Please checkout my test code. Not sure if there is something wrong with it.
In your test code you directly load and store float3. As you say those accesses will be split into 3 and the memory accesses are not coalesced anymore.
In my previous post, I have written, how you can access float3 in a coalesced manner.
The provided way should be fine, but there is an advanced way, which saves shared memory bandwidth (which should not be an issue in your case):
Pairs of threads cooperate to load 3 neighbouring 32 byte blocks with float4 accesses, which gives 8x float3 and then shuffle between them that each has 4x float3 at the end. The needed shuffle performance is less than the shared memory bandwidth from the first method.
You’re loading the same amount of data, and the problem is memory bound. Furthermore, although the load pattern for float3 may appear to be “not as efficient” as the load pattern for float4, the memory controller (and caches) are evidently “fixing” the problem. There is considerable temporal locality in the requests - the LD instructions are quite close to each other. Therefore the memory controller can spot this and not issue duplicate loads. So the memory bus utilization is likely to be very similar in both cases. And when it comes to distribution of the data after it returns on the memory bus, the caches will help there.
The data doesn’t support any other conclusion, in my view.
The screenshots show 66% L1 cache utilization in the float3 case and 0% in the float4 case, and the memory pipes are 3x as busy for float3, which is not a problem yet for your kernel, as L1 and L2 hit rate are still quite low, so with slow global memory, you cannot process more memory operations anyway.
Just seeing you are using more elements in the float3 case: 4 * N instead of 3 * N. So the same overall amount of data is loaded and stored in both cases.