Shared box’s inner dimension less-than 32Byte scenario of TMA CU_TENSOR_MAP_SWIZZLE_32B

Pattern Swizzle width Shared box’s inner dimension Repeats after Shared memory alignment Global memory alignment
CU_TENSOR_MAP_SWIZZLE_128B 128 bytes <=128 bytes 1024 bytes 128 bytes 128 bytes
CU_TENSOR_MAP_SWIZZLE_64B 64 bytes <=64 bytes 512 bytes 128 bytes 128 bytes
CU_TENSOR_MAP_SWIZZLE_32B 32 bytes <=32 bytes 256 bytes 128 bytes 128 bytes
CU_TENSOR_MAP_SWIZZLE_NONE (default) 128 bytes 16 bytes

The table in the programming-guide shows the requirements of the shared memory block.

May I ask for clarification on what the less-than sign < indicates here?


I tried this configuration :

    CUtensorMap tensor_map{};
    constexpr uint32_t rank    = 2;

    uint64_t size[rank]        = {96, 1024};
    uint32_t box_size[rank]    = {4, 64};

    uint64_t stride[rank-1] = size[0] * sizeof(uint32_t);
    uint32_t elem_stride[rank] = {1, 1};

    auto cuTensorMapEncodeTiled = get_cuTensorMapEncodeTiled();
    CUresult res = cuTensorMapEncodeTiled(
        &tensor_map, 
        CUtensorMapDataType::CU_TENSOR_MAP_DATA_TYPE_UINT32,
        rank, 
        tensor_ptr,
        size,
        stride,
        box_size,
        elem_stride,
        CUtensorMapInterleave::CU_TENSOR_MAP_INTERLEAVE_NONE,
        CUtensorMapSwizzle::CU_TENSOR_MAP_SWIZZLE_32B,
        CUtensorMapL2promotion::CU_TENSOR_MAP_L2_PROMOTION_NONE,
        CUtensorMapFloatOOBfill::CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE
    );

The data type is set to uint32_t, inner dimension as 4, and setting SWIZZLE_32B.
Here, the inner dimension measures 16 bytes, falling short of 32 bytes.
Will there be padding for the remaining 16 bytes?
And how should the size of the shared memory be configured? I frequently encounter an illegal memory access error.

Thank you for your assistance!