Exist vector 3 in cuda?

Hi! I am trying to use vector 3 type. Like below

__device__ __forceinline__ void stg96(const float &reg0, const float &reg1,
                                       const float &reg2,
                                       const float *addr) {
  asm volatile("st.global.v3.f32 [%0], {%1, %2, %3};\n"
               :
               : "l"(addr), "f"(reg0), "f"(reg1), "f"(reg2));
}

Seems the compiler will prompt bugs…And I modify it like this:

__device__ __forceinline__ void stg96(const float &reg0, const float &reg1,
                                       const float &reg2,
                                       const float *addr) {
  asm volatile("st.global.f32 [%0], {%1, %2, %3};\n"   <<<-----see here!!! Only delete the ".v3"
               :
               : "l"(addr), "f"(reg0), "f"(reg1), "f"(reg2));
}

Still prompt bugs!!!
Well, I am trying to have “only one instruction but read three values”. But my solution failed…Finally I have to use:

C[i] = sum[j];
C[i+1] = sum[j+1];
C[i+2] = sum[j+2];

which seems…slow. Does anyone know why v3 does not work or provide a higher efficiency solution? Thank you!!!

By the way, my error info is:

ptxas C:/Users/20247/AppData/Local/Temp/tmpxft_00002398_00000000-7_add2_kernel.ptx, line 2061; error   : Unknown modifier '.v3'
ptxas C:/Users/20247/AppData/Local/Temp/tmpxft_00002398_00000000-7_add2_kernel.ptx, line 2061; error   : Illegal vector size: 3
ptxas C:/Users/20247/AppData/Local/Temp/tmpxft_00002398_00000000-7_add2_kernel.ptx, line 2061; error   : Argument vector size mismatch for instruction 'st'
ptxas fatal   : Ptx assembly aborted due to errors
error: command 'C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v11.3\\bin\\nvcc.exe' failed with exit code 255

And I guess NVIDIA have float3? Such as: https://developer.download.nvidia.com/cg/length.html
https://developer.download.nvidia.com/cg/cross.html
Just use google to search float3, you will get something.

Sure, you can use float3 in CUDA C++. What you are showing here is PTX. And whether in CUDA C++ or PTX, this is not possible:

You can do this in CUDA C++:

__global__ void kernel(float3 *a, float3 *b){
     float3 temp = a[0];
     b[0] = temp;
}

But you won’t get one (SASS) instruction doing those loads and stores. The compiler will break that up into multiple instructions. CUDA (and PTX) support native/single load/store operations of 1,2,4,8, or 16 bytes only.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.