When I’m going to do some sparse convolution, I may need to manipulate some points that are not adjacent in the smem, what is the best way to load them into the matrixs that will be calculated by wmma operations? For example, if I have 128 points’ features in my smem, each point has 64 in_channels. I may need to load points 1, 3, 5, 7, 9…for a calculation. Should I just copy them to a continuous memory then use load_matrix_sync to load them, or just use realize this api by myself at mma level?
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Bank Conflicts When Using wmma::load_matrix in CUDA without Swizzle? | 0 | 93 | September 12, 2024 | |
Is loading the matrices in like this good practice for WMMA instructions in C++ CUDA? | 0 | 32 | December 30, 2024 | |
Question about efficient usage of wmma | 2 | 314 | February 29, 2024 | |
A schedule algorithm in thread block | 5 | 22 | September 24, 2024 | |
How to stridedly read data for Tensor core? | 2 | 236 | October 16, 2023 | |
How does the operation like "some_fragment.x[index]" work in wmma api? | 4 | 506 | March 26, 2024 | |
Data load question | 3 | 26 | December 18, 2024 | |
How about the "sp-meta" in "wgmma.mma_async.sp" | 2 | 162 | May 28, 2024 | |
About sparse mma on A100 | 11 | 23 | March 13, 2025 | |
Oes nvcuda::wmma::load_matrix_sync Perform Implicit Type Conversion? | 3 | 50 | September 11, 2024 |