Is float3 as fast as float4?

Hi there,

Recently, I just found float3 was as fast as float4. It looks weird based on my understanding. For the ptx file, float3 will use st.global.f32 or ld.global.f32, while st.global.v4.u32 or ld.global.v4.u32 for float4. Attached is my test cu file, which was compiled on rtx 3060 with nvcc 11.5.

The output :
Elapsed time for float3: 11.52 ms
Elapsed time for float4: 11.52 ms

Please educate me & thanks in advance.

Best,
Sway

test.zip (876 Bytes)

Please post your code inline, as text, using forum tools. Not as an attachment.

Directly loading and storing float3 from global memory is not ideal.

I would let the 32 threads of each warp cooperate:

  • Load the first, second and third 32 floats normally as float array, not float3 array (thread 0 loads first float - effectively [0].x, thread 1 loads second float [0].y, …)

  • Store them into a shared memory array of size 96.

  • Then read back as float3 type or the three components separately (should give the same code). You can do it with union or with reinterpret_cast (which actually is UB, but typically works in Cuda).

Alternatively store your float3 as 3 separate arrays for x, y, z.

I think cub::WarpLoad and cub::WarpStore could also do it. cub::WarpLoad — cub 2.5 documentation

Here is code.

#include <cuda_runtime.h>
#include <stdio.h>

#define N 40000000

__global__ void testFloat3(float3* data_in, float3* data) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < 4*N) {
        data[tid] = data_in[tid];
    }
}

__global__ void testFloat4(float4* data_in, float4* data) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < 3*N) {
        data[tid] = data_in[tid];
    }
}

int main() {
    float3* d_data3;
    float4* d_data4;
    float* d_data_in;

    cudaMalloc((void**)&d_data3, 4*N * sizeof(float3));
    cudaMalloc((void**)&d_data4, 3*N * sizeof(float4));
    cudaMalloc((void**)&d_data_in, 12*N * sizeof(float));

    float* h_data = (float*)malloc(4*3*N * sizeof(float));
    
    

    for (int i = 0; i < 3*4*N; i++) {
        h_data[i] = 1.0f;
    }

    cudaMemcpy(d_data_in, h_data, 12*N * sizeof(float), cudaMemcpyHostToDevice);
    //cudaMemcpy(d_data3, h_data3, 4*N * sizeof(float3), cudaMemcpyHostToDevice);
    //cudaMemcpy(d_data4, h_data4, 3*N * sizeof(float4), cudaMemcpyHostToDevice);

    int blockSize = 256;
    int numBlocks_3 = (4*N + blockSize - 1) / blockSize;
    int numBlocks_4 = (3*N + blockSize - 1) / blockSize;

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Test float3 performance
    cudaEventRecord(start);
    testFloat3<<<numBlocks_3, blockSize>>>((float3*)d_data_in, d_data3);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float millisecondsFloat3;
    cudaEventElapsedTime(&millisecondsFloat3, start, stop);
    printf("Elapsed time for float3: %.2f ms\n", millisecondsFloat3);

    cudaEvent_t start1, stop1;
    cudaEventCreate(&start1);
    cudaEventCreate(&stop1);

    // Test float4 performance
    cudaEventRecord(start1);
    testFloat4<<<numBlocks_4, blockSize>>>((float4*)d_data_in, d_data4);
    cudaEventRecord(stop1);
    cudaEventSynchronize(stop1);
    float millisecondsFloat4 ;
    cudaEventElapsedTime(&millisecondsFloat4, start1, stop1);

    //cudaMemcpy(h_data3, d_data3, 4*N * sizeof(float3), cudaMemcpyDeviceToHost);
    //cudaMemcpy(h_data4, d_data4, 3*N * sizeof(float4), cudaMemcpyDeviceToHost);

    
    printf("Elapsed time for float4: %.2f ms\n", millisecondsFloat4);

    cudaFree(d_data3);
    cudaFree(d_data4);
    cudaFree(d_data_in);
    free(h_data);


    return 0;
}```

Yes, that was my thought as well. But my test result shows that float3 is as fast as float4. Please checkout my test code. Not sure if there is something wrong with it.

Thanks.

In your test code you directly load and store float3. As you say those accesses will be split into 3 and the memory accesses are not coalesced anymore.

In my previous post, I have written, how you can access float3 in a coalesced manner.

The provided way should be fine, but there is an advanced way, which saves shared memory bandwidth (which should not be an issue in your case):
Pairs of threads cooperate to load 3 neighbouring 32 byte blocks with float4 accesses, which gives 8x float3 and then shuffle between them that each has 4x float3 at the end. The needed shuffle performance is less than the shared memory bandwidth from the first method.

Thank you. But the peformance of float3 (not coalesced) is as the same as that of float4. That is what I feel weird.

You can look at it with Nsight Compute to see exactly what happens.

There is no too much difference between the two kernels.

You’re loading the same amount of data, and the problem is memory bound. Furthermore, although the load pattern for float3 may appear to be “not as efficient” as the load pattern for float4, the memory controller (and caches) are evidently “fixing” the problem. There is considerable temporal locality in the requests - the LD instructions are quite close to each other. Therefore the memory controller can spot this and not issue duplicate loads. So the memory bus utilization is likely to be very similar in both cases. And when it comes to distribution of the data after it returns on the memory bus, the caches will help there.

The data doesn’t support any other conclusion, in my view.

1 Like

The screenshots show 66% L1 cache utilization in the float3 case and 0% in the float4 case, and the memory pipes are 3x as busy for float3, which is not a problem yet for your kernel, as L1 and L2 hit rate are still quite low, so with slow global memory, you cannot process more memory operations anyway.

Just seeing you are using more elements in the float3 case: 4 * N instead of 3 * N. So the same overall amount of data is loaded and stored in both cases.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.