Intermediate multiplications downgrades warp shuffling performance

Hi! I find intermediate multiplications downgrades the performance of a reduction with warp shuffle about 40%. Why those multiplications introduce such a high overhead?

Thank you very much!

__global__ void test_shfl(float * A){
    float x = threadIdx.x ;
    float sum = 0;
    // computation
    for(int i = 0; i < 10; ++i){
    x = __cosf(x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    }

    // Reduction with warp shuffling, execution tie = 0.229 ms
    sum =  __shfl_xor_sync(0xffffffff, x, 16, 32);
    sum += __shfl_xor_sync(0xffffffff, sum, 8, 32);
    sum += __shfl_xor_sync(0xffffffff, sum, 4, 32);
    sum += __shfl_xor_sync(0xffffffff, sum, 2, 32);
    sum += __shfl_xor_sync(0xffffffff, sum, 1, 32);
    
    // Intermediate multiplication + reduction with warp shuffling, execution time = 0.39 ms
    // sum =  0.001 * __shfl_xor_sync(0xffffffff, x, 16, 32);
    // sum += 0.001 * __shfl_xor_sync(0xffffffff, sum, 8, 32);
    // sum += 0.001 * __shfl_xor_sync(0xffffffff, sum, 4, 32);
    // sum += 0.001 * __shfl_xor_sync(0xffffffff, sum, 2, 32);
    // sum += 0.001 * __shfl_xor_sync(0xffffffff, sum, 1, 32);
    
    // Memory
    atomicAdd(A, sum);
}

int main(int argc, char** argv){
    cudaEvent_t fft_begin, fft_end;
    float elapsed_time;
    float *dA, *A;
    A = (float*)malloc(sizeof(float));
    cudaEventCreate(&fft_begin);
    cudaEventCreate(&fft_end); 
    cudaMalloc((void**) &dA, sizeof(float) * 1);
    cudaFuncSetAttribute(test_shfl, cudaFuncAttributeMaxDynamicSharedMemorySize, 65536);
    cudaEventRecord(fft_begin);
    
    for(int i = 0; i < 10; ++i){
    test_shfl <<<1, 1024, 65536>>>(dA);
    }
    cudaEventRecord(fft_end);
    cudaEventSynchronize(fft_begin);
    cudaEventSynchronize(fft_end);
    cudaEventElapsedTime(&elapsed_time, fft_begin, fft_end);

    cudaMemcpy((void*)A, (void*)dA, sizeof(float), cudaMemcpyDeviceToHost);

    printf("%d, %f\n", elapsed_time, *A);

    return 0;
}

probably because you are doing more work

The conventional wisdom that flops are cheap on a GPU must be considered in light of the comparison being made. If you have a conventional reduction from global memory with a large amount of global loads, it may be the case that additional arithmetic work is negligible to performance. But you don’t have that. All of your work is relatively low-latency instructions (compared to global load latency), at least up to the final atomic, so adding more work takes proportionally more time.

Furthermore, 0.001 is a double-precision constant. You are taking a float quantity, promoting it to double, doing a double-precision multiply, promoting a float add to double, then converting the result back to float. Especially on a GeForce GPU, this could be an additional impact to performance. You might get more palatable performance by decorating that to be a float constant i.e. 0.001f.

There are other possibly noteworthy aspects to your code, such as the fact that you don’t initialize the global data before doing an atomic add to it, and it’s also not clear why you would do an atomic add from every thread, when you have reduced values across a warp. You’re also using an incorrect format specifier (%d) to print the float quantity elapsed_time.

After a bit of testing on a GeForce 1660 Super, my guess is that the float/double conversions are the primary issue:

$ cat t52.cu
#include <cstdio>

__global__ void test_shfl(float * A){
    float x = threadIdx.x ;
    float sum = 0;
    // computation
    for(int i = 0; i < 10; ++i){
    x = __cosf(x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    x = __cosf(1 - x);
    }
#ifndef USE_MULT
    // Reduction with warp shuffling, execution tie = 0.229 ms
    sum =  __shfl_xor_sync(0xffffffff, x, 16, 32);
    sum += __shfl_xor_sync(0xffffffff, sum, 8, 32);
    sum += __shfl_xor_sync(0xffffffff, sum, 4, 32);
    sum += __shfl_xor_sync(0xffffffff, sum, 2, 32);
    sum += __shfl_xor_sync(0xffffffff, sum, 1, 32);
#else
    // Intermediate multiplication + reduction with warp shuffling, execution time = 0.39 ms
#ifndef MY_CONST
#define MY_CONST 0.001f
#endif
    sum =  MY_CONST * __shfl_xor_sync(0xffffffff, x, 16, 32);
    sum += MY_CONST * __shfl_xor_sync(0xffffffff, sum, 8, 32);
    sum += MY_CONST * __shfl_xor_sync(0xffffffff, sum, 4, 32);
    sum += MY_CONST * __shfl_xor_sync(0xffffffff, sum, 2, 32);
    sum += MY_CONST * __shfl_xor_sync(0xffffffff, sum, 1, 32);
#endif
    // Memory
    atomicAdd(A, sum);
}

int main(int argc, char** argv){
    cudaEvent_t fft_begin, fft_end;
    float elapsed_time;
    float *dA, *A;
    A = (float*)malloc(sizeof(float));
    cudaEventCreate(&fft_begin);
    cudaEventCreate(&fft_end);
    cudaMalloc((void**) &dA, sizeof(float) * 1);
    cudaMemset(dA, 0, sizeof(float));
    cudaFuncSetAttribute(test_shfl, cudaFuncAttributeMaxDynamicSharedMemorySize, 65536);
    cudaEventRecord(fft_begin);

    for(int i = 0; i < 10; ++i){
    test_shfl <<<1, 1024, 65536>>>(dA);
    }
    cudaEventRecord(fft_end);
    cudaEventSynchronize(fft_begin);
    cudaEventSynchronize(fft_end);
    cudaEventElapsedTime(&elapsed_time, fft_begin, fft_end);

    cudaMemcpy((void*)A, (void*)dA, sizeof(float), cudaMemcpyDeviceToHost);

    printf("%f, %f\n", elapsed_time, *A);

    return 0;
}
$ nvcc -o t52 t52.cu
$ ./t52
0.098528, 163840.000000
$ nvcc -o t52 t52.cu -DUSE_MULT -DMY_CONST=0.001
$ ./t52
0.153792, 10.282268
$ nvcc -o t52 t52.cu -DUSE_MULT -DMY_CONST=0.001f
$ ./t52
0.098912, 10.282268
$

There may be other possibilities as well, such as measuring performance on a debug code (compiled with debug switch -G).

1 Like

Hi Robert,

Many thanks for your kindly reply.

This demo is just for validation of the idea, not for practical use. I implement this demo just to verify the overhead of the intermediate multiplications.

After modifying the 0.001 to 0.001f, the overhead reduces to 1% from 40%. It seems like the datatype converting introducing the overhead.

Thanks for your help again!

Best,
Shixun

That is only part of the problem. On modern consumer GPUs, double-precision operations have a massively lower throughput than single-precision operations. So the double-precision multiply itself also contributes not insignificantly to the slowdown observed.

1 Like

Thanks for your reply! I tests the demo on an Tesla Turing T4 GPU with sm75. I compile the demo with -O3. From the CUDA programming guide. The throughput of double precision is 32/cycle, while the single precision is 64/cycle.

Check section 19.6.1 of the current Programming Guide: For sm_75 (Turing), there are 64 FP32 units per SM, but only 2 FP64 units per SM, so the throughput ratio FP32 / FP64 is 32. You may have looked at sm_70 (Volta), which is used in professional GPUs.

Exactly! I found this detail in a reference [5] throughput of double precision is 2 for compute capability 7.5 GPUs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.