Fast constant memory access from all threads in CUDA

I have 3 arrays of size 10 each in my code. I also have 512 threads in each of 1024 blocks in my grid accessing the elements of these arrays simultaneously during the kernel execution.

I tried using constant memory instead of the global memory for these 3 arrays and I did get some speedup but it’s still not enough. If I just remove access to one of these arrays in my kernel, the reduction in execution time is significant. I don’t want to lose so much time just because of memory access.

Is there a way to make these array accesses faster?