Understanding PTX

Where can I find the assembly instruction reference for cuda? For example, I want to read about “ld.global.ca.f64”. I know it sounds like loading from global memory to a f64 register (ca?).
How X.Y is formed?

I didn’t get the answer from the PTX manuals [1, 2].

[1] https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html
[2] https://docs.nvidia.com/cuda/parallel-thread-execution/index.html

For ld see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-ld

For .global see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#state-spaces

For .ca see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators

[i]Cache at all levels, likely to be accessed again.

The default load instruction cache operation is ld.ca, which allocates cache lines in all levels (L1 and L2) with normal eviction policy. Global data is coherent at the L2 level, but multiple L1 caches are not coherent for global data. If one thread stores to global memory via one L1 cache, and a second thread loads that address via a second L1 cache with ld.ca, the second thread may get stale L1 cache data, rather than the data stored by the first thread. The driver must invalidate global L1 cache lines between dependent grids of parallel threads. Stores by the first grid program are then correctly fetched by the second grid program issuing default ld.ca loads cached in L1.[/i]