Functions like cudaMalloc() take the size as an argument of type ‘size_t’. This is an unsigned 64-bit integer type on all platforms supported by CUDA. Your variables like ‘A_num_rows’ are presumably of type ‘int’, a signed 32-bit integer type on all platforms supported by CUDA.
The compiler warns that the size computation overflows when performed using ‘int’; the overflowed result is assigned to the 64-bit argument. That’s not what you want. You want the correct size computed as a 64-bit quantity using 64-bit arithmetic.
The best-practices idiom for this kind of computation is therefore to put the sizeof() part first:
sizeof(double) * A_num_rows * A_num_rows
Since the result of sizeof() is of type ‘size_t’, subsequent computation is performed using 64-bit integer arithmetic.
You certainly won’t be able to store a dense matrix of 164986 x 164986 ‘double’ elements on the GPU, as that would require 218 GB of storage, whereas the maximum on-board memory of GPUs is ≤ 48 GB (the Quadro RTX 8000 is currently the GPU with the largest on-board memory).