I have a application where each PE(CPU thread) allocates its own instance of GPU context. Each threads calls cudaMalloHost for sizes from memPoolBoundaries[19] =[pow(2,8) : pow(2,26)] bytes when the application is initiated. Each PE has multiple buffer allocation for these sizes. If I run with lesser number of PEs say 8 it works, but on increasing the PEs to 64 it throws an error “mapping of buffer object failed”.
ALL PEs execute this in init phase
for(int i = 0; i < 19 ; i++){
int bufSize = CpvAccess(gpuManager).memPoolBoundaries[i]; // size of memory required (256,512,1024…)
int numBuffers = nbuffers[i]; // no. of allocation for size of bufSize
pools[i].size = bufSize;
pools[i].head = NULL;
Header *hd = pools[i].head;
Header *previous = NULL;
for(int j = 0; j < numBuffers; j++){
cudaChk(cudaMallocHost((void **)&hd,(sizeof(Header)+bufSize)));
It fails on mallocHost with an error mapping of buffer object failed