Vector global load/store regression from CUDA 6.5 to CUDA 7.0

I run many-GPU simulations of nanomotors using hybrid molecular dynamics and multi-particle collision dynamics. After the upgrade from the CUDA 6.5 driver (340.65) to the CUDA 7.0 driver (346.82), I noticed that the overall run-time of the hybrid OpenCL/MPI code increased significantly by about 10%.

I could isolate the performance regression to a specific memory-bandwidth intensive kernel, whose run-time has increased by 35% using the CUDA 7.0 driver. The issue may be reproduced using these self-contained test cases:

git clone https://gist.github.com/fdb18f394e33f8eb5b27 vector_global_load_store

cd vector_global_load_store && make

Each work-item loads two n-dimensional vectors from global memory, performs a component-wise multiply-add, and stores the resulting vector back to global memory:

#if __OPENCL_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#endif

__kernel void test_double3(__global double3 *restrict d_pos,
                           __global double3 *restrict d_vel)
{
  const uint gid = get_global_id(0);
  const double timestep = 0.001;
  d_pos[gid] += d_vel[gid]*timestep;
}

The tests measure the mean time of 1000 executions, and dump the compiled PTX pseudo-assembly code to a file.

These results were obtained using a Tesla K20m on CentOS 6.6 x86_64:

OpenCL 1.1 CUDA 6.5.33

test_double: 0.672 ms
test_double2: 1.36 ms
test_double3: 2.94 ms
test_double4: 2.99 ms
OpenCL 1.1 CUDA 7.0.53

test_double: 0.673 ms
test_double2: 1.31 ms
test_double3: 3.83 ms
test_double4: 3.82 ms

The run-times for the scalar (double) and 2-component vector (double2) are comparable. The run-time for the 3- and 4-component vectors (double3 and double4), however, has increased by 30% with the CUDA 7.0 driver.

For readers familiar with CUDA vector types, please note that OpenCL C vector types are subtly different. All of the OpenCL C vector types including 3-component vectors are aligned to a power-of-two, which makes all of the vector types suitable for coalesced access to global memory as far as supported by the device. For example, CUDA aligns float3 to 3sizeof(float), whereas OpenCL C aligns float3 to 4sizeof(float).

Further, OpenCL C implements vector algebra for vector types, such as component-wise addition or multiplication. While a float3 vector occupies four components in global memory, only three components are in fact used when adding two float3 vectors. The OpenCL C vector types float3 or double3 are therefore useful in particle simulation codes for storing coordinates, velocities and forces.

A comparison of the PTX pseudo-assembly codes produced by the two driver versions shows that global loads and stores of double3 and double4 vector types are compiled differently:

//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Tue Dec  2 13:17:36 2014 (1417544256)
// Driver 340.65
//

.version 3.0
.target sm_30, texmode_independent
.address_size 32

.entry test_double3(
        .param .u32 .ptr .global .align 32 test_double3_param_0,
        .param .u32 .ptr .global .align 32 test_double3_param_1
)
{
        .reg .f64       %fd<19>;
        .reg .s32       %r<12>;

ld.param.u32    %r5, [test_double3_param_0];
        ld.param.u32    %r6, [test_double3_param_1];
        // inline asm
        mov.u32         %r1, %envreg3;
        // inline asm
        // inline asm
        mov.u32         %r2, %ntid.x;
        // inline asm
        // inline asm
        mov.u32         %r3, %ctaid.x;
        // inline asm
        // inline asm
        mov.u32         %r4, %tid.x;
        // inline asm
        add.s32         %r7, %r4, %r1;
        mad.lo.s32      %r8, %r3, %r2, %r7;
        shl.b32         %r9, %r8, 5;
        add.s32         %r10, %r6, %r9;
        mov.f64         %fd1, 0d3F50624DD2F1A9FC;
        add.s32         %r11, %r5, %r9;
        mov.f64         %fd2, 0d0000000000000000;
        ld.global.v2.f64        {%fd3, %fd4}, [%r11];
        ld.global.v2.f64        {%fd5, %fd6}, [%r10];
        fma.rn.f64      %fd9, %fd5, %fd1, %fd3;
        fma.rn.f64      %fd10, %fd6, %fd1, %fd4;
        ld.global.v2.f64        {%fd11, %fd12}, [%r11+16];
        ld.global.v2.f64        {%fd13, %fd14}, [%r10+16];
        st.global.v2.f64        [%r11], {%fd9, %fd10};
        fma.rn.f64      %fd15, %fd13, %fd1, %fd11;
        st.global.v2.f64        [%r11+16], {%fd15, %fd2};
        ret;
}
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19683103
// Driver 346.82
// Based on LLVM 3.4svn
//

.version 4.2
.target sm_35, texmode_independent
.address_size 32

        // .globl       test_double3

.entry test_double3(
        .param .u32 .ptr .global .align 32 test_double3_param_0,
        .param .u32 .ptr .global .align 32 test_double3_param_1
)
{
        .reg .s32       %r<12>;
        .reg .f64       %fd<10>;

ld.param.u32    %r1, [test_double3_param_0];
        ld.param.u32    %r2, [test_double3_param_1];
        mov.b32 %r3, %envreg3;
        mov.u32         %r4, %ntid.x;
        mov.u32         %r5, %ctaid.x;
        mad.lo.s32      %r6, %r5, %r4, %r3;
        mov.u32         %r7, %tid.x;
        add.s32         %r8, %r6, %r7;
        shl.b32         %r9, %r8, 5;
        add.s32         %r10, %r2, %r9;
        ld.global.f64   %fd1, [%r10+16];
        ld.global.f64   %fd2, [%r10+8];
        ld.global.f64   %fd3, [%r10];
        add.s32         %r11, %r1, %r9;
        ld.global.f64   %fd4, [%r11];
        fma.rn.f64      %fd5, %fd3, 0d3F50624DD2F1A9FC, %fd4;
        ld.global.f64   %fd6, [%r11+8];
        fma.rn.f64      %fd7, %fd2, 0d3F50624DD2F1A9FC, %fd6;
        ld.global.f64   %fd8, [%r11+16];
        fma.rn.f64      %fd9, %fd1, 0d3F50624DD2F1A9FC, %fd8;
        st.global.f64   [%r11+16], %fd9;
        st.global.f64   [%r11+8], %fd7;
        st.global.f64   [%r11], %fd5;
        ret;
}

The Kepler microarchitecture support global loads and stores up to 16 bytes. For the double3 or double4 vector type, the CUDA 6.5 driver splits a load/store into two 16-byte operations. The CUDA 7.0 driver however splits a load/store into three or four 8-byte operations. The latter appears to be less efficient due to the increased memory latency of four/three versus two operations.

I would suggest filing a bug at http://developer.nvidia.com

I would also suggest testing with the latest available driver for K20m, such as the 352.39 driver that is included with the CUDA 7.5 toolkit.

Using the latest CUDA 7.5 driver (version 352.41), the run-time for the double3 case is indeed better than with the CUDA 7.0 driver:

OpenCL 1.2 CUDA 7.5.18
Tesla K20m
./test_double: 0.68 ms
./test_double2: 1.31 ms
./test_double3: 3.16 ms
./test_double4: 3.01 ms

However, my simulation code is still 10% slower with the CUDA 7.5 driver compared to the CUDA 6.5 driver.

I wrote a second test case, which is closer to the original kernel that incurred a 35% performance regression:

git clone https://gist.github.com/d97adc6e3ab2d8eb9171
cd d97adc6e3ab2d8eb9171 && make
./verlet

These are the results for the CUDA 6.5/7.0/7.5 drivers:

OpenCL 1.1 CUDA 6.5.33
Tesla K20m
./verlet: 6.45 ms
OpenCL 1.1 CUDA 7.0.53
Tesla K20m
./verlet: 8.42 ms
OpenCL 1.2 CUDA 7.5.18
Tesla K20m
./verlet: 8.35 ms

The CUDA 7.0 driver is 30.5% slower than the CUDA 6.5 driver.
The CUDA 7.5 driver is 29.5% slower than the CUDA 6.5 driver.

I have filed a bug report in the NVIDIA developer zone:

https://partners.nvidia.com/bug/viewbug/1687670