Vector global load/store regression from CUDA 6.5 to CUDA 7.0

pc1 · September 21, 2015, 2:09am

I run many-GPU simulations of nanomotors using hybrid molecular dynamics and multi-particle collision dynamics. After the upgrade from the CUDA 6.5 driver (340.65) to the CUDA 7.0 driver (346.82), I noticed that the overall run-time of the hybrid OpenCL/MPI code increased significantly by about 10%.

I could isolate the performance regression to a specific memory-bandwidth intensive kernel, whose run-time has increased by 35% using the CUDA 7.0 driver. The issue may be reproduced using these self-contained test cases:

git clone https://gist.github.com/fdb18f394e33f8eb5b27 vector_global_load_store

cd vector_global_load_store && make

Each work-item loads two n-dimensional vectors from global memory, performs a component-wise multiply-add, and stores the resulting vector back to global memory:

#if __OPENCL_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#endif

__kernel void test_double3(__global double3 *restrict d_pos,
                           __global double3 *restrict d_vel)
{
  const uint gid = get_global_id(0);
  const double timestep = 0.001;
  d_pos[gid] += d_vel[gid]*timestep;
}

The tests measure the mean time of 1000 executions, and dump the compiled PTX pseudo-assembly code to a file.

These results were obtained using a Tesla K20m on CentOS 6.6 x86_64:

OpenCL 1.1 CUDA 6.5.33

test_double: 0.672 ms
test_double2: 1.36 ms
test_double3: 2.94 ms
test_double4: 2.99 ms

OpenCL 1.1 CUDA 7.0.53

test_double: 0.673 ms
test_double2: 1.31 ms
test_double3: 3.83 ms
test_double4: 3.82 ms

The run-times for the scalar (double) and 2-component vector (double2) are comparable. The run-time for the 3- and 4-component vectors (double3 and double4), however, has increased by 30% with the CUDA 7.0 driver.

For readers familiar with CUDA vector types, please note that OpenCL C vector types are subtly different. All of the OpenCL C vector types including 3-component vectors are aligned to a power-of-two, which makes all of the vector types suitable for coalesced access to global memory as far as supported by the device. For example, CUDA aligns float3 to 3sizeof(float), whereas OpenCL C aligns float3 to 4sizeof(float).

Further, OpenCL C implements vector algebra for vector types, such as component-wise addition or multiplication. While a float3 vector occupies four components in global memory, only three components are in fact used when adding two float3 vectors. The OpenCL C vector types float3 or double3 are therefore useful in particle simulation codes for storing coordinates, velocities and forces.

A comparison of the PTX pseudo-assembly codes produced by the two driver versions shows that global loads and stores of double3 and double4 vector types are compiled differently:

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Tue Dec  2 13:17:36 2014 (1417544256)
// Driver 340.65
//

.version 3.0
.target sm_30, texmode_independent
.address_size 32

.entry test_double3(
        .param .u32 .ptr .global .align 32 test_double3_param_0,
        .param .u32 .ptr .global .align 32 test_double3_param_1
)
{
        .reg .f64       %fd<19>;
        .reg .s32       %r<12>;

ld.param.u32    %r5, [test_double3_param_0];
        ld.param.u32    %r6, [test_double3_param_1];
        // inline asm
        mov.u32         %r1, %envreg3;
        // inline asm
        // inline asm
        mov.u32         %r2, %ntid.x;
        // inline asm
        // inline asm
        mov.u32         %r3, %ctaid.x;
        // inline asm
        // inline asm
        mov.u32         %r4, %tid.x;
        // inline asm
        add.s32         %r7, %r4, %r1;
        mad.lo.s32      %r8, %r3, %r2, %r7;
        shl.b32         %r9, %r8, 5;
        add.s32         %r10, %r6, %r9;
        mov.f64         %fd1, 0d3F50624DD2F1A9FC;
        add.s32         %r11, %r5, %r9;
        mov.f64         %fd2, 0d0000000000000000;
        ld.global.v2.f64        {%fd3, %fd4}, [%r11];
        ld.global.v2.f64        {%fd5, %fd6}, [%r10];
        fma.rn.f64      %fd9, %fd5, %fd1, %fd3;
        fma.rn.f64      %fd10, %fd6, %fd1, %fd4;
        ld.global.v2.f64        {%fd11, %fd12}, [%r11+16];
        ld.global.v2.f64        {%fd13, %fd14}, [%r10+16];
        st.global.v2.f64        [%r11], {%fd9, %fd10};
        fma.rn.f64      %fd15, %fd13, %fd1, %fd11;
        st.global.v2.f64        [%r11+16], {%fd15, %fd2};
        ret;
}

//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19683103
// Driver 346.82
// Based on LLVM 3.4svn
//

.version 4.2
.target sm_35, texmode_independent
.address_size 32

        // .globl       test_double3

.entry test_double3(
        .param .u32 .ptr .global .align 32 test_double3_param_0,
        .param .u32 .ptr .global .align 32 test_double3_param_1
)
{
        .reg .s32       %r<12>;
        .reg .f64       %fd<10>;

ld.param.u32    %r1, [test_double3_param_0];
        ld.param.u32    %r2, [test_double3_param_1];
        mov.b32 %r3, %envreg3;
        mov.u32         %r4, %ntid.x;
        mov.u32         %r5, %ctaid.x;
        mad.lo.s32      %r6, %r5, %r4, %r3;
        mov.u32         %r7, %tid.x;
        add.s32         %r8, %r6, %r7;
        shl.b32         %r9, %r8, 5;
        add.s32         %r10, %r2, %r9;
        ld.global.f64   %fd1, [%r10+16];
        ld.global.f64   %fd2, [%r10+8];
        ld.global.f64   %fd3, [%r10];
        add.s32         %r11, %r1, %r9;
        ld.global.f64   %fd4, [%r11];
        fma.rn.f64      %fd5, %fd3, 0d3F50624DD2F1A9FC, %fd4;
        ld.global.f64   %fd6, [%r11+8];
        fma.rn.f64      %fd7, %fd2, 0d3F50624DD2F1A9FC, %fd6;
        ld.global.f64   %fd8, [%r11+16];
        fma.rn.f64      %fd9, %fd1, 0d3F50624DD2F1A9FC, %fd8;
        st.global.f64   [%r11+16], %fd9;
        st.global.f64   [%r11+8], %fd7;
        st.global.f64   [%r11], %fd5;
        ret;
}

The Kepler microarchitecture support global loads and stores up to 16 bytes. For the double3 or double4 vector type, the CUDA 6.5 driver splits a load/store into two 16-byte operations. The CUDA 7.0 driver however splits a load/store into three or four 8-byte operations. The latter appears to be less efficient due to the increased memory latency of four/three versus two operations.

Robert_Crovella · September 21, 2015, 2:12am

I would suggest filing a bug at http://developer.nvidia.com

I would also suggest testing with the latest available driver for K20m, such as the 352.39 driver that is included with the CUDA 7.5 toolkit.

pc1 · September 21, 2015, 9:27pm

Using the latest CUDA 7.5 driver (version 352.41), the run-time for the double3 case is indeed better than with the CUDA 7.0 driver:

OpenCL 1.2 CUDA 7.5.18
Tesla K20m
./test_double: 0.68 ms
./test_double2: 1.31 ms
./test_double3: 3.16 ms
./test_double4: 3.01 ms

However, my simulation code is still 10% slower with the CUDA 7.5 driver compared to the CUDA 6.5 driver.

I wrote a second test case, which is closer to the original kernel that incurred a 35% performance regression:

git clone https://gist.github.com/d97adc6e3ab2d8eb9171
cd d97adc6e3ab2d8eb9171 && make
./verlet

These are the results for the CUDA 6.5/7.0/7.5 drivers:

OpenCL 1.1 CUDA 6.5.33
Tesla K20m
./verlet: 6.45 ms

OpenCL 1.1 CUDA 7.0.53
Tesla K20m
./verlet: 8.42 ms

OpenCL 1.2 CUDA 7.5.18
Tesla K20m
./verlet: 8.35 ms

The CUDA 7.0 driver is 30.5% slower than the CUDA 6.5 driver.
The CUDA 7.5 driver is 29.5% slower than the CUDA 6.5 driver.

pc1 · September 21, 2015, 9:31pm

I have filed a bug report in the NVIDIA developer zone:

[url]https://partners.nvidia.com/bug/viewbug/1687670[/url]