I’m working on the TK1 and have encountered a problem with performance.
If I use cudaMallocManaged for a large array, say 200 MB, and then access much of the array in a kernel I’ve made, performance is fast.
However, if I make the array 400 MB, my program is slower, regardless of if I’m still doing the same amount of work. It seems as if I’m being punished for having a larger array.
My best guess is that each cudaDeviceSynchronize is “copying” the entire array from device to host, as is pointed out in bullet 5 of this thread:
I don’t understand why this copy takes place on a TK1 anyways since it’s unified memory.
When I allocate the vector using cudaHostAlloc instead of managed, performance is tremendously slower. The one function which takes much longer uses some atomicAdd calls, but I don’t understand why using managed memory would be any different than cudaHostAlloc on a device with unified memory.
And if I have to use managed memory, is there a way to increase buffer size without suffering due to the entire array being copied at each synchronize stage?