Even though my card contains 2651 MB of memory, any time I try to use cudaMalloc to allocate more than 1151 MB at once, I get the error: Runtime API Error: out of memory.
Note that this only happens when I allocate more than 1151 MB using a single cudaMalloc call. If I split my request up into chunks, it works fine. In other words, this returns device out of memory.
Do you know of any acceptable workaround? (Other than distributing my data structure across several, noncontiguous chunks of memory.) For example, if I issue multiple cudaMalloc requests consecutively, and I am the only person using the device, can I then treat the union of those memory blocks as one large memory block?
Alternatively, I can install CentOS. Do you know if this issue effects the Linux drivers?
Linux is completely unaffected by the allocation limits. The TCC driver (which you can use with C1060 if you don’t really need decent display output as it can’t coexist with standard NV WDDM devices right now) also is unaffected by the allocation limits.
–It sounds like the reason they limit a single allocation to the above is because if it were larger than the amount of system memory available for the GPU, then it would be impossible for the page to fit in system memory.–
Using the TCC compute driver bypasses this page requirement and gives you full control of device memory. (At least this is my understanding)