Accroding my understand of SMs, I think all of them is paralled and independed each other. So when I try put my program to multiple SMs from single SM, I expected the speed would promote 82x (My GPU is 3090 that process 82 SMs) than before. But the result is no difference between two schemes. I don’t know why. Is my understand wrong?
Please do not post pictures of code. Post the code as text, and use the available text formatting tools in the edit box.
Perhaps you are confusing SMs with streams. They are essentially unrelated.
I see no evidence in the code you have posted that in one case you are using a single SM, and in another case you are using multiple SMs.
A single SM launch might look like this:
cudaHashRandom <<<1, THREAD_3090, 0, stream[i]>>>(...);
A multiple SM launch might look like:
cudaHashRandom <<<160, THREAD_3090, 0, stream[i]>>>(...);
You haven’t shown any code that sets
BLOCK_NUM and even if you did, I see no conditional behavior inside the
while loop that is using alternate values.
If your question is actually about stream usage, that is unclear. But I will repeat, there is no connection between stream usage and SM usage. A single kernel launch running in a single stream can easily use all the SMs in your device.