When I’m going to do some sparse convolution, I may need to manipulate some points that are not adjacent in the smem, what is the best way to load them into the matrixs that will be calculated by wmma operations? For example, if I have 128 points’ features in my smem, each point has 64 in_channels. I may need to load points 1, 3, 5, 7, 9…for a calculation. Should I just copy them to a continuous memory then use load_matrix_sync to load them, or just use realize this api by myself at mma level?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Bank Conflicts When Using wmma::load_matrix in CUDA without Swizzle? | 0 | 138 | September 12, 2024 | |
Fastest Tiled WMMA for Matrices of Any Size? | 3 | 225 | October 26, 2024 | |
Oes nvcuda::wmma::load_matrix_sync Perform Implicit Type Conversion? | 3 | 85 | September 11, 2024 | |
Using store_matrix_sync with SMEM: bank conflict? | 3 | 103 | September 25, 2024 | |
Is loading the matrices in like this good practice for WMMA instructions in C++ CUDA? | 0 | 36 | December 30, 2024 | |
Use vector load data from global mem to shm | 1 | 226 | April 5, 2024 | |
How to Load 4 Consecutive Values from Shared Memory into uint MultiA for MMA? | 4 | 14 | November 25, 2024 | |
Padding of mma operation | 20 | 87 | December 19, 2024 | |
Load data for tensor core | 23 | 83 | February 5, 2025 | |
Find out more opportunities for accelerating SpMM using sparse tensor cores | 5 | 471 | March 24, 2024 |