Why do I need to convert a pointer to shared address space before using the ldmatrix instruction?

I’m a little confused about the first bit of assembly in this code:

unsigned int addr;
__asm__ __volatile__(
 "{ .reg .u64 addr; cvta.to.shared.u64 addr, %1; cvt.u32.u64 %0, addr; }\n"
 : "=r"(addr)
 : "l"((void *)(
  A_shared 
  + ((warpIdx % 2) * A_elems / 2)
  + (ax0_0 * A_elems / 8)
  + ((threadIdx.x % 16) * shared_stride) 
  + (k_0_1 * 16)
  + ((threadIdx.x / 16) * 8)
 )));

unsigned* aOff = (unsigned *)(A_shared_warp + (ax0_0 * 8));
__asm__ __volatile__(
  "ldmatrix.sync.aligned.m8n8.x4.shared.b16"
  "{%0, %1, %2, %3}, [%4];\n"
  : "=r"(aOff[0]), "=r"(aOff[1]), "=r"(aOff[2]), "=r"(aOff[3])
  : "r"(addr)
);

From my understanding (and please correct me if I’m wrong), it’s converting %1 (which is a 64-bit address) to a shared address, storing that in addr, and then converting the value in addr to a 32-bit value.

What I don’t understand is: why is this necessary, when A_shared is already declared as a __shared__ variable. Presumably any pointer to A_shared is already a shared address?

I don’t know of any statements or guarantees in CUDA C++ that a __shared__ address will numerically fall within the PTX shared logical space. In fact, a simple experiment can be devised to prove, at least in some cases, it does not. Since all pointers in CUDA C++ for current systems/setups are 64-bit, and the shared state space in PTX land evidently involves a 32-bit space, we have another datapoint that says that the __shared__ pointer in CUDA C++ cannot be considered absolutely identical to a shared state space address in PTX. This question may also be of interest.

Got it. I guess I don’t understand the different between pointers and address spaces. (Why is one 64-bit, why is the other 32-bit?) – do you mind pointing me to any docs that would be helpful to better understand this stuff?

There is no concept of state spaces directly in CUDA C++. The documents I am aware of are the ones that I already linked. Did you traverse those links? I mean follow the links I have provided in this thread to the linked thread, read that thread in its entirety, and take note of the links there to documentation. More generally, refer to the PTX docs (since the concept doesn’t exist anywhere else) and do text searches for state space information. CUDA docs library is here (or just go to docs.nvidia.com and click on CUDA). Look at the left hand column there for all the “books in the library”. Here is the PTX doc (“book”). Section 5 of that doc has state space information. A description of the shared state space is here.

If you are asking why are pointers 64-bit, that is a convention chosen by CUDA C++ as well as many other C++ implementations. This article covers some history of 64-bit machine architectures (on windows, CUDA adheres to the LLP64 model mentioned there, and on linux it adheres to the LP64 model mentioned there). As I mentioned, the question about why is the shared state space 32-bits, can only be answered in docs relevant to PTX, as it only has meaning there. Also, “why?” might be an open-ended question, that may eventually only end with the answer “because that is the way the designers chose to design the machine.”

If I were speculating, I would guess that the CUDA designers anticipated that the shared space would never need 4GB+ of “actual space” and they felt that there was some sort of benefit to limiting the shared state space to be represented in 32-bits. Since 32-bits has half the size of 64-bits, this could have ramifications e.g. for instruction encoding (shorter instruction lengths, smaller code space, less time to fetch, etc.) as well as actual hardware design (32-bit pathways presumably require half the space of 64-bit pathways). I would imagine that the designers felt that such a tradeoff overall provided benefits that were enough to swing the decision to use 32-bits for the shared state space.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.