Hi, it’s siri again, in order to optimize elememntwise add, i write below kernel, but it seems not improve performance, could anyone give some advices on that? Thanks!
What does the CUDA profiler tell you about the performance characteristics of this kernel? I recently explained in another thread that on modern GPUs, there should be little to no difference in memory throughput resulting based on the width of individual accesses (this was different on GPUs ten years ago).
Please note that GPUs require data to be naturally aligned, so accessing ‘int4’ requires 128-bit alignment, which may not be guaranteed by the d_in, d_in1, and d_out pointers, which point to ‘int’ data which is 32-bit aligned.
Thanks for your replay, I add the compared kernel. if throughput is not bottleneck, so which part should be considering.
I use nvprof for device_add_v1_kernel kernel:
HW:v100
Achieved Occupancy 0.96540
Global Load Throughput 513.20GB/s
Global Store Throughput 257.47GB/s
Those numbers look good to me. The read bandwidth is twice the write bandwidth as expected, and the sum (780 GB/sec) is in the range I would expect for a V100. I have never used a V100 myself, but the general expectation is that a GPU achieves around 80% of theoretical bandwidth in real life.