I have a application where each PE(CPU thread) allocates its own instance of GPU. Each threads calls cudaMalloHost for sizes[bufferSize] : pow(2,8) : pow(2,26) bytes when the application is initiated. Each PE has multiple buffer allocation for these sizes. If I run with lesser number of PEs say 8 it works, but on increasing the PEs to 64 it throws an error “mapping of buffer object failed”.
It would be useful to learn how to approach it.