I hope anyone could help me with this issue, I've implemented a simple device application that modifies a float array through the device pointer. This array is allocated by cudaHostAlloc (mapped memory), and then I access it through the host pointer to check its results (I've synchronized between the CPU and GPU). the problem is that it works well when with small sizes, but when more than 30 MB are allocated it does not work well. I don't understand what am I missing here.
does it have to do with the block, grid dims, and thread size?
Thank you in advance.