I run many-GPU simulations of nanomotors using hybrid molecular dynamics and multi-particle collision dynamics. After the upgrade from the CUDA 6.5 driver (340.65) to the CUDA 7.0 driver (346.82), I noticed that the overall run-time of the hybrid OpenCL/MPI code increased significantly by about 10%.
I could isolate the performance regression to a specific memory-bandwidth intensive kernel, whose run-time has increased by 35% using the CUDA 7.0 driver. The issue may be reproduced using these self-contained test cases:
git clone https://gist.github.com/fdb18f394e33f8eb5b27 vector_global_load_store
cd vector_global_load_store && make
Each work-item loads two n-dimensional vectors from global memory, performs a component-wise multiply-add, and stores the resulting vector back to global memory:
#if __OPENCL_VERSION__ < 120
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#endif
__kernel void test_double3(__global double3 *restrict d_pos,
__global double3 *restrict d_vel)
{
const uint gid = get_global_id(0);
const double timestep = 0.001;
d_pos[gid] += d_vel[gid]*timestep;
}
The tests measure the mean time of 1000 executions, and dump the compiled PTX pseudo-assembly code to a file.
These results were obtained using a Tesla K20m on CentOS 6.6 x86_64:
OpenCL 1.1 CUDA 6.5.33
test_double: 0.672 ms
test_double2: 1.36 ms
test_double3: 2.94 ms
test_double4: 2.99 ms
OpenCL 1.1 CUDA 7.0.53
test_double: 0.673 ms
test_double2: 1.31 ms
test_double3: 3.83 ms
test_double4: 3.82 ms
The run-times for the scalar (double) and 2-component vector (double2) are comparable. The run-time for the 3- and 4-component vectors (double3 and double4), however, has increased by 30% with the CUDA 7.0 driver.
For readers familiar with CUDA vector types, please note that OpenCL C vector types are subtly different. All of the OpenCL C vector types including 3-component vectors are aligned to a power-of-two, which makes all of the vector types suitable for coalesced access to global memory as far as supported by the device. For example, CUDA aligns float3 to 3sizeof(float), whereas OpenCL C aligns float3 to 4sizeof(float).
Further, OpenCL C implements vector algebra for vector types, such as component-wise addition or multiplication. While a float3 vector occupies four components in global memory, only three components are in fact used when adding two float3 vectors. The OpenCL C vector types float3 or double3 are therefore useful in particle simulation codes for storing coordinates, velocities and forces.
A comparison of the PTX pseudo-assembly codes produced by the two driver versions shows that global loads and stores of double3 and double4 vector types are compiled differently:
//
// Generated by NVIDIA NVVM Compiler
// Compiler built on Tue Dec 2 13:17:36 2014 (1417544256)
// Driver 340.65
//
.version 3.0
.target sm_30, texmode_independent
.address_size 32
.entry test_double3(
.param .u32 .ptr .global .align 32 test_double3_param_0,
.param .u32 .ptr .global .align 32 test_double3_param_1
)
{
.reg .f64 %fd<19>;
.reg .s32 %r<12>;
ld.param.u32 %r5, [test_double3_param_0];
ld.param.u32 %r6, [test_double3_param_1];
// inline asm
mov.u32 %r1, %envreg3;
// inline asm
// inline asm
mov.u32 %r2, %ntid.x;
// inline asm
// inline asm
mov.u32 %r3, %ctaid.x;
// inline asm
// inline asm
mov.u32 %r4, %tid.x;
// inline asm
add.s32 %r7, %r4, %r1;
mad.lo.s32 %r8, %r3, %r2, %r7;
shl.b32 %r9, %r8, 5;
add.s32 %r10, %r6, %r9;
mov.f64 %fd1, 0d3F50624DD2F1A9FC;
add.s32 %r11, %r5, %r9;
mov.f64 %fd2, 0d0000000000000000;
ld.global.v2.f64 {%fd3, %fd4}, [%r11];
ld.global.v2.f64 {%fd5, %fd6}, [%r10];
fma.rn.f64 %fd9, %fd5, %fd1, %fd3;
fma.rn.f64 %fd10, %fd6, %fd1, %fd4;
ld.global.v2.f64 {%fd11, %fd12}, [%r11+16];
ld.global.v2.f64 {%fd13, %fd14}, [%r10+16];
st.global.v2.f64 [%r11], {%fd9, %fd10};
fma.rn.f64 %fd15, %fd13, %fd1, %fd11;
st.global.v2.f64 [%r11+16], {%fd15, %fd2};
ret;
}
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19683103
// Driver 346.82
// Based on LLVM 3.4svn
//
.version 4.2
.target sm_35, texmode_independent
.address_size 32
// .globl test_double3
.entry test_double3(
.param .u32 .ptr .global .align 32 test_double3_param_0,
.param .u32 .ptr .global .align 32 test_double3_param_1
)
{
.reg .s32 %r<12>;
.reg .f64 %fd<10>;
ld.param.u32 %r1, [test_double3_param_0];
ld.param.u32 %r2, [test_double3_param_1];
mov.b32 %r3, %envreg3;
mov.u32 %r4, %ntid.x;
mov.u32 %r5, %ctaid.x;
mad.lo.s32 %r6, %r5, %r4, %r3;
mov.u32 %r7, %tid.x;
add.s32 %r8, %r6, %r7;
shl.b32 %r9, %r8, 5;
add.s32 %r10, %r2, %r9;
ld.global.f64 %fd1, [%r10+16];
ld.global.f64 %fd2, [%r10+8];
ld.global.f64 %fd3, [%r10];
add.s32 %r11, %r1, %r9;
ld.global.f64 %fd4, [%r11];
fma.rn.f64 %fd5, %fd3, 0d3F50624DD2F1A9FC, %fd4;
ld.global.f64 %fd6, [%r11+8];
fma.rn.f64 %fd7, %fd2, 0d3F50624DD2F1A9FC, %fd6;
ld.global.f64 %fd8, [%r11+16];
fma.rn.f64 %fd9, %fd1, 0d3F50624DD2F1A9FC, %fd8;
st.global.f64 [%r11+16], %fd9;
st.global.f64 [%r11+8], %fd7;
st.global.f64 [%r11], %fd5;
ret;
}
The Kepler microarchitecture support global loads and stores up to 16 bytes. For the double3 or double4 vector type, the CUDA 6.5 driver splits a load/store into two 16-byte operations. The CUDA 7.0 driver however splits a load/store into three or four 8-byte operations. The latter appears to be less efficient due to the increased memory latency of four/three versus two operations.