I don’t understand the mechanism of ldmatrix operation. Take
ldmatrix.sync.aligned.m8n8.x2.trans.shared.b16 {d0, d1}, [addr];
as an example.
Is that means only thread 0-15 will be activate during the operation?
I don’t understand the mechanism of ldmatrix operation. Take
ldmatrix.sync.aligned.m8n8.x2.trans.shared.b16 {d0, d1}, [addr];
as an example.
Is that means only thread 0-15 will be activate during the operation?
If only thread 0-15 will be activated during the time, why will there be bank conflict during the ldmatrix?
Think about a condition:
__shared__ A[128][16];
// use ldmatrix to load A[idx0][0-16], A[idx1][0-16], ... A[idx7][0-16],
// idx0, idx1, ... idx7 are dependent on the input data.
Will the condition cause bank conflict? If so, is there any way to avoid it?
It is described in the ptx documentation. PTX ISA 8.5
All threads are active. 8x8 matrix has 64 half elements. Four threads are used per row (loading 16 bytes) → 32 threads are used.
The input row pointers are distributed across the warp. For m8n8.x2
16 rows are loaded. The corresponding pointer values are taken from threads 0-15.
Thanks, but I still have no idea if there will be bank conflict in ldmatrix.
If there still have bank conflict when using ldmatrix, why do we use this operation? We can manipulate the data directly, which is a more flexiable way.
I would assume the ordinary rules for bank conflicts apply here as well. You could set up a simple experiment and profile it.
If the register layout of the matrix matches your use-case, ldmatrix
can load multiple matrices in a single instruction. With manual loading, you would need n instructions for n matrices.
Here is an example code for ldmatrix x2. When profiled with ncu, it shows bank conflicts.
//nvcc -g -lineinfo -std=c++17 -arch=native main.cu -o main
#include <iostream>
#include <thrust/device_vector.h>
/*
ldmatrix.sync.aligned.shape.num{.trans}{.ss}.type r, [p];
.shape = {.m8n8};
.num = {.x1, .x2, .x4};
.ss = {.shared{::cta}};
.type = {.b16};
*/
__device__
void ldmatrix_x2(unsigned int (&x)[2], const void* ptr){
asm volatile("ldmatrix.sync.aligned.m8n8.x2.shared.b16 {%0, %1}, [%2];"
: "=r"(x[0]), "=r"(x[1])
: "l"(__cvta_generic_to_shared(ptr)));
}
__global__
void mykernel(const int* loadOffsets, bool print){
alignas(16) __shared__ half A[128 * 16];
for(int i = threadIdx.x; i < 128*16; i += blockDim.x){
A[i] = i;
}
__syncthreads();
const int lane = threadIdx.x % 32;
unsigned int result[2];
const int offset = loadOffsets[lane];
ldmatrix_x2(result, &A[offset]);
half2 loaded[2];
memcpy(&loaded[0], &result[0], sizeof(half2) * 2);
if(print){
for(int m = 0; m < 2; m++){
for(int t = 0; t < 32; t++){
if(lane == t){
printf("%4d %4d ", int(loaded[m].x), int(loaded[m].y));
if(lane % 4 == 3){
printf("\n");
}
}
__syncwarp();
}
if(lane == 0){
printf("\n");
}
__syncwarp();
}
}
}
int main(){
thrust::device_vector<int> d_loadOffsets(32, 0);
for(int i = 0; i < 16; i++){
const int row = i % 8;
const int matrix = i / 8;
d_loadOffsets[i] = row * 16 + matrix * 8;
}
mykernel<<<1,32>>>(d_loadOffsets.data().get(), true);
cudaDeviceSynchronize();
// Shared Load Matrix: Requests 16.384, Wavefronts 33.393, Bank Conflicts 0
for(int i = 0; i < 16; i++){
const int row = i / 2;
const int matrix = i % 2;
d_loadOffsets[i] = row * 16 + matrix * 8;
}
std::cout << "offsets: ";
for(int i = 0; i < 16; i++){
std::cout << d_loadOffsets[i] << " ";
}
std::cout << "\n";
mykernel<<<1024,512>>>(d_loadOffsets.data().get(), false);
cudaDeviceSynchronize();
// Shared Load Matrix: Requests 16.384, Wavefronts 131.674, Bank Conflicts 98.304
for(int i = 0; i < 16; i++){
const int row = i / 2;
const int matrix = i % 2;
d_loadOffsets[i] = (4*row) * 16 + matrix * 8;
}
std::cout << "offsets: ";
for(int i = 0; i < 16; i++){
std::cout << d_loadOffsets[i] << " ";
}
std::cout << "\n";
mykernel<<<1024,512>>>(d_loadOffsets.data().get(), false);
cudaDeviceSynchronize();
// Shared Load Matrix: Requests 16.384, Wavefronts 66.488, Bank Conflicts 32.768
for(int i = 0; i < 16; i++){
const int row = i % 8;
const int matrix = i / 8;
d_loadOffsets[i] = row * 16 + matrix * 8;
}
std::cout << "offsets: ";
for(int i = 0; i < 16; i++){
std::cout << d_loadOffsets[i] << " ";
}
std::cout << "\n";
mykernel<<<1024,512>>>(d_loadOffsets.data().get(), false);
cudaDeviceSynchronize();
// Shared Load Matrix: Requests 16.384, Wavefronts 263.070, Bank Conflicts 229.376
for(int i = 0; i < 16; i++){
const int row = i % 8;
const int matrix = i / 8;
d_loadOffsets[i] = (4*row) * 16 + matrix * 8;
}
std::cout << "offsets: ";
for(int i = 0; i < 16; i++){
std::cout << d_loadOffsets[i] << " ";
}
std::cout << "\n";
mykernel<<<1024,512>>>(d_loadOffsets.data().get(), false);
cudaDeviceSynchronize();
}
offsets: 0 8 16 24 32 40 48 56 | 64 72 80 88 96 104 112 120
Shared Load Matrix: Requests 16.384, Wavefronts 33.393, Bank Conflicts 0
offsets: 0 8 64 72 128 136 192 200 | 256 264 320 328 384 392 448 456
Shared Load Matrix: Requests 16.384, Wavefronts 131.674, Bank Conflicts 98.304
offsets: 0 16 32 48 64 80 96 112 | 8 24 40 56 72 88 104 120
Shared Load Matrix: Requests 16.384, Wavefronts 66.488, Bank Conflicts 32.768
offsets: 0 64 128 192 256 320 384 448 | 8 72 136 200 264 328 392 456
Shared Load Matrix: Requests 16.384, Wavefronts 263.070, Bank Conflicts 229.376
Case 3 seems a bit strange to me. In total, the same offsets as in case 1 are accessed. However, it appears that the two sets of rows for the two matrices are processed independently which doubles the number of wavefronts and introduces bank conflicts.
(same observation for cases 4 and 2.)
Are all your offsets provided to ldmatrix
aligned to 16 bytes boundaries?