Optimizing High Performance Linpack

I am noticing some funny things when I am running the version of HPL downloaded from the dev center. No matter what the N size I chose, the memory usage is always the same. One would think that a N of 20000 would use much less memory as a N of 40000. However, they are always stuck at the same ~2GB mark when I look at nvidia-sml. If I go high enough, I get an “failed to connect to cudaHostRegister memory” error that most likely means I breached my memory limit. However, below this limit, I am still only using the ~2GB of memory. My Telsa has much more than 2GB of memory.

Also, if I tell HPL through HPL.dat to run a variety of configurations, there is no way of knowing what it is running other than from the P, Q, NDIM or N numbers. In other words, how do I tell from the output my pfact, broadcast type, etc? I do not know how someone is supposed to optimize without being able to see this information.