Is float3 as fast as float4?

sway · July 16, 2024, 10:25am

Hi there,

Recently, I just found float3 was as fast as float4. It looks weird based on my understanding. For the ptx file, float3 will use st.global.f32 or ld.global.f32, while st.global.v4.u32 or ld.global.v4.u32 for float4. Attached is my test cu file, which was compiled on rtx 3060 with nvcc 11.5.

The output :
Elapsed time for float3: 11.52 ms
Elapsed time for float4: 11.52 ms

Please educate me & thanks in advance.

Best,
Sway

test.zip (876 Bytes)

Robert_Crovella · July 16, 2024, 1:12pm

Please post your code inline, as text, using forum tools. Not as an attachment.

Curefab · July 16, 2024, 1:17pm

Directly loading and storing float3 from global memory is not ideal.

I would let the 32 threads of each warp cooperate:

Load the first, second and third 32 floats normally as float array, not float3 array (thread 0 loads first float - effectively [0].x, thread 1 loads second float [0].y, …)
Store them into a shared memory array of size 96.
Then read back as float3 type or the three components separately (should give the same code). You can do it with union or with reinterpret_cast (which actually is UB, but typically works in Cuda).

Alternatively store your float3 as 3 separate arrays for x, y, z.

I think cub::WarpLoad and cub::WarpStore could also do it. cub::WarpLoad — cub 2.5 documentation

sway · July 16, 2024, 1:28pm

Here is code.

#include <cuda_runtime.h>
#include <stdio.h>

#define N 40000000

__global__ void testFloat3(float3* data_in, float3* data) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < 4*N) {
        data[tid] = data_in[tid];
    }
}

__global__ void testFloat4(float4* data_in, float4* data) {
    int tid = threadIdx.x + blockIdx.x * blockDim.x;
    if (tid < 3*N) {
        data[tid] = data_in[tid];
    }
}

int main() {
    float3* d_data3;
    float4* d_data4;
    float* d_data_in;

    cudaMalloc((void**)&d_data3, 4*N * sizeof(float3));
    cudaMalloc((void**)&d_data4, 3*N * sizeof(float4));
    cudaMalloc((void**)&d_data_in, 12*N * sizeof(float));

    float* h_data = (float*)malloc(4*3*N * sizeof(float));
    
    

    for (int i = 0; i < 3*4*N; i++) {
        h_data[i] = 1.0f;
    }

    cudaMemcpy(d_data_in, h_data, 12*N * sizeof(float), cudaMemcpyHostToDevice);
    //cudaMemcpy(d_data3, h_data3, 4*N * sizeof(float3), cudaMemcpyHostToDevice);
    //cudaMemcpy(d_data4, h_data4, 3*N * sizeof(float4), cudaMemcpyHostToDevice);

    int blockSize = 256;
    int numBlocks_3 = (4*N + blockSize - 1) / blockSize;
    int numBlocks_4 = (3*N + blockSize - 1) / blockSize;

    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Test float3 performance
    cudaEventRecord(start);
    testFloat3<<<numBlocks_3, blockSize>>>((float3*)d_data_in, d_data3);
    cudaEventRecord(stop);
    cudaEventSynchronize(stop);
    float millisecondsFloat3;
    cudaEventElapsedTime(&millisecondsFloat3, start, stop);
    printf("Elapsed time for float3: %.2f ms\n", millisecondsFloat3);

    cudaEvent_t start1, stop1;
    cudaEventCreate(&start1);
    cudaEventCreate(&stop1);

    // Test float4 performance
    cudaEventRecord(start1);
    testFloat4<<<numBlocks_4, blockSize>>>((float4*)d_data_in, d_data4);
    cudaEventRecord(stop1);
    cudaEventSynchronize(stop1);
    float millisecondsFloat4 ;
    cudaEventElapsedTime(&millisecondsFloat4, start1, stop1);

    //cudaMemcpy(h_data3, d_data3, 4*N * sizeof(float3), cudaMemcpyDeviceToHost);
    //cudaMemcpy(h_data4, d_data4, 3*N * sizeof(float4), cudaMemcpyDeviceToHost);

    
    printf("Elapsed time for float4: %.2f ms\n", millisecondsFloat4);

    cudaFree(d_data3);
    cudaFree(d_data4);
    cudaFree(d_data_in);
    free(h_data);


    return 0;
}```

sway · July 16, 2024, 1:30pm

Yes, that was my thought as well. But my test result shows that float3 is as fast as float4. Please checkout my test code. Not sure if there is something wrong with it.

Thanks.

Curefab · July 16, 2024, 1:33pm

In your test code you directly load and store float3. As you say those accesses will be split into 3 and the memory accesses are not coalesced anymore.

In my previous post, I have written, how you can access float3 in a coalesced manner.

The provided way should be fine, but there is an advanced way, which saves shared memory bandwidth (which should not be an issue in your case):
Pairs of threads cooperate to load 3 neighbouring 32 byte blocks with float4 accesses, which gives 8x float3 and then shuffle between them that each has 4x float3 at the end. The needed shuffle performance is less than the shared memory bandwidth from the first method.

sway · July 16, 2024, 1:42pm

Thank you. But the peformance of float3 (not coalesced) is as the same as that of float4. That is what I feel weird.

Curefab · July 16, 2024, 1:47pm

You can look at it with Nsight Compute to see exactly what happens.

sway · July 16, 2024, 1:55pm

There is no too much difference between the two kernels.

Robert_Crovella · July 16, 2024, 1:56pm

You’re loading the same amount of data, and the problem is memory bound. Furthermore, although the load pattern for float3 may appear to be “not as efficient” as the load pattern for float4, the memory controller (and caches) are evidently “fixing” the problem. There is considerable temporal locality in the requests - the LD instructions are quite close to each other. Therefore the memory controller can spot this and not issue duplicate loads. So the memory bus utilization is likely to be very similar in both cases. And when it comes to distribution of the data after it returns on the memory bus, the caches will help there.

The data doesn’t support any other conclusion, in my view.

Curefab · July 16, 2024, 2:12pm

The screenshots show 66% L1 cache utilization in the float3 case and 0% in the float4 case, and the memory pipes are 3x as busy for float3, which is not a problem yet for your kernel, as L1 and L2 hit rate are still quite low, so with slow global memory, you cannot process more memory operations anyway.

Just seeing you are using more elements in the float3 case: 4 * N instead of 3 * N. So the same overall amount of data is loaded and stored in both cases.

system · July 30, 2024, 2:12pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Float type performance comparisons CUDA Programming and Performance	2	5222	June 25, 2007
Difference between float[n] and float* Are there any performance differences? CUDA Programming and Performance	7	5788	May 27, 2010
CUDA memory transactions CUDA Programming and Performance	9	8821	April 11, 2011
Number of kilobytes transferred to/from shared memory twice the expected CUDA Programming and Performance	12	703	September 29, 2018
float reduction, cpu and cuda answers differ CUDA Programming and Performance	4	3319	April 1, 2008
floyd on cuda--why so slow? CUDA Programming and Performance	15	5463	May 2, 2009
Effective global memory bandwidth? CUDA Programming and Performance	17	17571	September 18, 2007
Texture vs. Global Memory CUDA Programming and Performance	4	2014	August 6, 2009
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	797	November 20, 2023
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011

Is float3 as fast as float4?

Related topics