How do V100 and A100 GPUs handle two users simultaneously sending jobs without any special partitioning?

The memory is shared. If user A allocates 30 GB on a 32 GB GPU, then user B won’t be able to allocate more than 2 GB (or less), otherwise a CUDA error will be reported (out of memory).

The computational resources are time-sliced. The details of the time slicing are not published nor controllable. For both V100 and A100, kernels from user A will be allowed to run for a period of time, then they will be halted (if not completed) and kernels from user B will be allowed to run for a period of time. Then the A/B time slicing will continue, until all kernels are finished.