there are some problems about cuda shared memroy

There are two problems that bother me。
1.
my kernel include the code with:

This loop body seems to have a lot of warp divergence. (The cudaNumberOfEachSpecies, cudaMuOfEachSpecies, and cudaJ in the code are all constant memory. So I don’t think there will be much performance loss even with warp divergence.)
But I still feel itchy. I found a way on the Internet to avoid warp divergence, but I did not know that it brought down the performance:

What causes this? In theory, warp divergence is avoided here, but why does the performance decrease?
This loop body in each thread probably loops 12 times.

Part of the code above。


The cudaSpin here is a global memory, and neiNspin can be said to be random (the addresses accessed in the same block are not adjacent).
Since there are 12 loops in the outer layer and 3 loops here, the array accesses here are 36, so I think the time here is the most consumed.

An example in a tutorial (a video from a long time ago, I can’t find it now), I saw a method: read the global memory first into shared memory, and then read it out (there is no sharing between threads) and Acceleration was reached (he did not say that acceleration was achieved for any reason); I tried it and it really accelerated.
What is the reason for this?


At this point, one block applied for nearly half of the shared memory, which caused the SM utilization to drop particularly low, but why is the performance better? This shared memory is only used as an intermediary and does not implement sharing between threads.
What causes this performance improvement?

thank you!