Could anyone give advices on optimizing elementwise add/mul?

siri · April 4, 2019, 5:47am

Hi, it’s siri again, in order to optimize elememntwise add, i write below kernel, but it seems not improve performance, could anyone give some advices on that? Thanks!

global void device_add_vector4_kernel(int* d_in, int* d_in1, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for(int i = idx; i < N/4; i += stride) {
int4 tmp;
int4 in0 = reinterpret_cast<int4*>(d_in)[i];
int4 in1 = reinterpret_cast<int4*>(d_in1)[i];
tmp.x = in0.x + in1.x;
tmp.y = in0.y + in1.y;
tmp.z = in0.z + in1.z;
tmp.w = in0.w + in1.w;
reinterpret_cast<int4*>(d_out)[i] = tmp;
}
}

compared with below kernel:
global void device_add_v1_kernel(int* d_in, int* d_in1, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = idx; i < N; i += blockDim.x * gridDim.x) {
int in0 = d_in[i];
int in1 = d_in1[i];
d_out[i] = in0 + in1;
}
}

njuffa · April 4, 2019, 6:28am

Does not improve performance compared to what?

What does the CUDA profiler tell you about the performance characteristics of this kernel? I recently explained in another thread that on modern GPUs, there should be little to no difference in memory throughput resulting based on the width of individual accesses (this was different on GPUs ten years ago).

Please note that GPUs require data to be naturally aligned, so accessing ‘int4’ requires 128-bit alignment, which may not be guaranteed by the d_in, d_in1, and d_out pointers, which point to ‘int’ data which is 32-bit aligned.

siri · April 4, 2019, 9:12am

Thanks for your replay, I add the compared kernel. if throughput is not bottleneck, so which part should be considering.
I use nvprof for device_add_v1_kernel kernel:
HW:v100
Achieved Occupancy 0.96540
Global Load Throughput 513.20GB/s
Global Store Throughput 257.47GB/s

so Is there any possible to improve performance?

njuffa · April 4, 2019, 5:38pm

Those numbers look good to me. The read bandwidth is twice the write bandwidth as expected, and the sum (780 GB/sec) is in the range I would expect for a V100. I have never used a V100 myself, but the general expectation is that a GPU achieves around 80% of theoretical bandwidth in real life.

siri · April 8, 2019, 2:24am

Thanks njuffa!