- Can the target address of
store_matrix_sync
be Shared Memory (SMEM), or does it have to be Global Memory? - If the target address is SMEM, is there a possibility of bank conflicts?
- Does
store_matrix_sync
have built-in support for swizzling to avoid bank conflicts, or does it need to be manually implemented?
from the programming guide:
individual matrix elements must be accessed from memory (shared or global) after calling
store_matrix_sync
.
Since the pattern is unspecified (again, quoting from the programming guide, please read the entire section I linked) its not really sensible to answer that question, in my opinion. At least, I wouldn’t be able to answer it. (Even if I could offer an answer, it might vary by CUDA version, or by GPU architecture, or perhaps other unknown factors.) Or if you prefer, the answer is “yes”. You could try to use nsight compute to test a particular case, if that is of interest.
It’s unspecified, and AFAIK there would be no way for you to manually implement it, since the function is intentionally (by design, and by specification) opaque in its behavior.
There is the option to switch to a PTX mma
op, which could expose the behavior. There are numerous forum posts with examples.
Very good answer! Very helpful, thanks!!!
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.