The speed of program run on multiple SMs is similar to the speed that run on single SM?

Accroding my understand of SMs, I think all of them is paralled and independed each other. So when I try put my program to multiple SMs from single SM, I expected the speed would promote 82x (My GPU is 3090 that process 82 SMs) than before. But the result is no difference between two schemes. I don’t know why. Is my understand wrong?

Perhaps you are confusing SMs with streams. They are essentially unrelated.

I see no evidence in the code you have posted that in one case you are using a single SM, and in another case you are using multiple SMs.

A single SM launch might look like this:

cudaHashRandom <<<1, THREAD_3090, 0, stream[i]>>>(...);

A multiple SM launch might look like:

cudaHashRandom <<<160, THREAD_3090, 0, stream[i]>>>(...);

You haven’t shown any code that sets BLOCK_NUM and even if you did, I see no conditional behavior inside the while loop that is using alternate values.

If your question is actually about stream usage, that is unclear. But I will repeat, there is no connection between stream usage and SM usage. A single kernel launch running in a single stream can easily use all the SMs in your device.