How to convert a 32-bit operation to a 4-bit or 8-bit operation on cpu?

To the best of my knowledge, the existing quantization method is operating on 32-bit.
In order to quantize weight of CNN as well as reduce memory footprint and then port the quantized model into the mobile device, how to convert a 32-bit operation to a 4-bit or 8-bit operation on cpu?

It is not clear what you are trying to do. You might want to look at CUDA’s SIMD intrinsics:

https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SIMD.html

I applied the quantization technique using pytorch, and the weight is quantified, but the computation on the CPU is not.

In other words, if i quantify the weight of the deep learning model in my CPU,
32bit CPU : 0000 0000 … 0010

I want to make this work on a 4-bit only CPU as follows:
4bit CPU : 0010

Through this, we want to reduce the amount of computation.

I initially assumed a typo when the OP referred to the CPU. CUDA is a programming environment for GPUs.

I don’t know pytorch.

To process multiple chunks of 8-bit data in SIMD fashion, you can use AVX-2 intrinsics on an Intel or AMD CPU. There is an overview here:

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

I don’t know of any processors that offer nibble-wise SIMD instructions (nibble = 4 bits). You could write the code yourself, processing 16 nibbles at once in a 64-bit scalar register. While this would be emulated using multiple native CPU instructions, it could still be faster because the instruction sequence would produce 16 results at once.

GPUs are currently 32-bit processors, and CPUs are commonly 64-bit processors these days, with 32-bit processor in embedded applications (cost as low as \$2). If you want a true scalar 4-bit processor you would have to build one yourself, for example by using an FPGA.