From what I can tell based on the extremely limited and fragmentary information presented, your kernel has sufficient parallelism to cover relevant latencies. One thing I would recommend trying is to make each thread block smaller (say 16x16) and then run more of them. This approach often allows using internal resources more fully in the presence of granularity constraints, but it is impossible to predict whether this results in a measurable performance advantage in a particular use case.
Your code seems to be memory bandwidth limited, assuming “result” and “buffer” are in global memory. So you want to make sure to get the best possible memory interface utilization by paying attention to access patterns, and would want to consider moving load access to the texture path.
The profiler can tell you quite a bit about the performance characteristics of your kernel and guide your optimization process. If you have not done so yet, I would suggest familiarizing yourself with this important tool.
If the code is memory bandwidth limited, as I suspect (use the profiler to confirm or refute this working hypothesis), giving more work to each thread is unlikely to increase the performance as it doesn’t change the computation/memory ratio, i.e. FLOPS per byte consumed.