Hi, @suohaoran1
Thanks for your testing. Here are some statuses from our test for your reference:
- r36.4.4: allocate 3G/4G/5G/6G CUDA buffer can work but allocate 7G trigger system hang and reboot.
- r36.4.7: allocate 4G occasionally fails (depends on the cache size)
Dropping cache or test on a clear reboot system allows CUDA allocation up to 4GB.
This might help in some use cases. (ex, llama3.2:3b)
$ sudo su
# sync && echo 3 > /proc/sys/vm/drop_caches
# exit
Please note that our internal team is actively working on this issue.
Will keep updating the topic.
Thanks.