Could anyone give advices on optimizing elementwise add/mul?

Hi, it’s siri again, in order to optimize elememntwise add, i write below kernel, but it seems not improve performance, could anyone give some advices on that? Thanks!

global void device_add_vector4_kernel(int* d_in, int* d_in1, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for(int i = idx; i < N/4; i += stride) {
int4 tmp;
int4 in0 = reinterpret_cast<int4*>(d_in)[i];
int4 in1 = reinterpret_cast<int4*>(d_in1)[i];
tmp.x = in0.x + in1.x;
tmp.y = in0.y + in1.y;
tmp.z = in0.z + in1.z;
tmp.w = in0.w + in1.w;
reinterpret_cast<int4*>(d_out)[i] = tmp;

compared with below kernel:
global void device_add_v1_kernel(int* d_in, int* d_in1, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for(int i = idx; i < N; i += blockDim.x * gridDim.x) {
int in0 = d_in[i];
int in1 = d_in1[i];
d_out[i] = in0 + in1;

Does not improve performance compared to what?

What does the CUDA profiler tell you about the performance characteristics of this kernel? I recently explained in another thread that on modern GPUs, there should be little to no difference in memory throughput resulting based on the width of individual accesses (this was different on GPUs ten years ago).

Please note that GPUs require data to be naturally aligned, so accessing ‘int4’ requires 128-bit alignment, which may not be guaranteed by the d_in, d_in1, and d_out pointers, which point to ‘int’ data which is 32-bit aligned.

Thanks for your replay, I add the compared kernel. if throughput is not bottleneck, so which part should be considering.
I use nvprof for device_add_v1_kernel kernel:
Achieved Occupancy 0.96540
Global Load Throughput 513.20GB/s
Global Store Throughput 257.47GB/s

so Is there any possible to improve performance?

Those numbers look good to me. The read bandwidth is twice the write bandwidth as expected, and the sum (780 GB/sec) is in the range I would expect for a V100. I have never used a V100 myself, but the general expectation is that a GPU achieves around 80% of theoretical bandwidth in real life.

Thanks njuffa!