setting maxreg in nsight

how do I set the maxreg/thread in nsight or anyway in the code?

The best approach is to use the launch bounds mechanism which is discussed in the programming guide. This won’t depend on any nsight settings.

The nvcc compiler has a -maxrregcount switch which is covered in the nvcc manual.

If you poke around in the nsight CUDA settings, you will find a place to specify it.

another question in order to save some share mem, anyway I can store some data in cache or other place, that has relative fast access read speed?

Constant memory, e.g. constant, provides fast access (similar to registers) provided the access is uniform (all threads in a warp access the same data). Limitations: data must be read-only within the kernel, size limitation of 64 KB minus whatever is needed by math library functions.

In Nsight maxregcount is specified in:

project properties > build > Settings > Tool Settings tab > Optimization

Besides shared memory there is constant memory, there is also a read-only cache which the compiler might use for you (if you qualify your pointers with const and restrict), but can be forced with the __ldg intrinsic.