cudaMallocPitch is failed while multi GPUs are controlled by separated CPU processes despite the fac...

I have an NVIDIA Quadro K2100M which has a 2GB memory and a CC 3.0.
I’m using Visual Studio 2013, x64 platform.

I’m using this test to learn how cudaMallocPitch is working:
#define CUDA_CHECK_ERROR(stmt, fname, line)
cudaError_t cudaStatus;
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess)
printf(“File: %s\nLine: %i\nCUDA statement: \n%s\nCUDA error - %08d\n\n”, fname, line, stmt, cudaStatus);
printf(“CUDA error information: %s\n”, cudaGetErrorString(cudaStatus));
printf(“File: %s\nLine: %i\nCUDA statement: \n%s\nCUDA error - %08d\n\n”, fname, line, stmt, cudaStatus);
printf(“CUDA No error!!!\n”);

#define CUDA(stmt) do {

unsigned int TestSize = 104857600;
unsigned int TestSizeWidth = 10485760;
unsigned int TestSizeHeight = 10485760;
unsigned int Index = 0;
int *dev_c = 0;
size_t free;
size_t total;
size_t pitch;



CUDA(cudaMemGetInfo(&free, &total));

std::cout << " Begin - Available heap memory: " << (float)(free / 1048576.0f) << “MB” << std::endl;

while (1)
//CUDA(cudaMalloc((void**)&dev_c, TestSize));
CUDA(cudaMallocPitch((void**)&dev_c, &pitch, TestSizeWidth, TestSizeHeight));

	CUDA(cudaMemGetInfo(&free, &total));

	std::cout << " Available heap memory: " << (float)(free / 1048576.0f) << "MB" << std::endl;


	CUDA(cudaMemGetInfo(&free, &total));

	std::cout << " After release - Available heap memory: " << (float)(free / 1048576.0f) << "MB";

	TestSize += 104857600;


        std::cout << " Iteration index: " << Index << std::endl;

Every loop iteration the requested allocation size from the GPU memory is growing by a 100MB.
When I’m using the cudaMalloc API I’m getting that the loop was stucked during its iteration 20 because the last printed Index value was 19.
This is exactly what I was expected to get due to the fact that the total GPU memory is 2GB and I’m asking 100MB more each loop iteration starting from 100MB.

But when I’m using the same loop but with the cudaMallocPitch API which ask for 100MB also by sending a width of 10MB and height of 10MB the first call is failed and return an error number 2 which is the cudaErrorMemoryAllocation error of the CUDA driver.

Please advise.

10M times 10M is not 100M

10M times 10M is 100 Trillion

You’re not asking for 100MB in the failing case, you’re asking for 100TB

You are totally right. Sorry for spending your time…
Conclusion - Don’t work at late hour!
The issue can be closed.

Actually, I was made this test in order to learn how cudaMallocPitch behaves because I got the same error while using it with GeForce GTX 1080 TI and\or GeForce GTX 1080 GPUs which are part of entire system that include 4 GPUs (1 1080 TI and 3 1080).

Each GPU is controlled by a dedicated CPU thread which calls to cudaSetDevice with the right device index at the begining of its running.

Based on a configuration file information the application know how much CPU threads shall be created.

I can also run my application several times as a separated processes that each one will control different GPU.

I’m using OpenCV version 3.2 in order to perform an image Background Subtraction.

First, you shall create the BackgroundSubtractorMOG2 object by using this method: cv::cuda::createBackgroundSubtractorMOG2 and after that you shall call its apply method.

The first time apply method is called all required memory is alocated once.

My image size is 10000 cols and 7096 rows. Each pixel is 1B (Grayscale).

When I run my application as a one process which have several threads (each one for each GPU) everything works fine but when I run it 4 times as a separated processes (each one for each GPU) the OpenCV apply function start to fail due to cudaMallocPitch ‘not enough memory’ failure.

For all GPUs i was verified that I have enough available memory before apply was activated for the first time. For the 1080 it is reported that I have ~5.5GB and for the the 1080 TI I have ~8.3GB and the requested size is: width - 120000bytes, Height - 21288bytes - ~2.4GB.

Please advise.

The problem source was found:

cudaMallocPitch API returned value was cudaErrorMemoryAllocation due to the fact that there wasn’t available OS virtual memory which used by the OS when the process performs read\write accesses to the GPU physical memory.

Because of that, the CUDA driver fails any kind of GPU physical memory allocation.

The complexity here was to figured out why this API is failed while enough GPU physical memory is exist (checked by cudaMemGetInfo API).

I started to analyze two points:

1.Why I don’t have enough virtual memory in my PC? By performing the following link instructions I changed its size and the problem was disappeared: How To Optimize The Paging File In Windows

2.Why my process consume a lot of OS virtual memory? In the past I figured it out that in order to have a better performance during processing time I shall allocate all required GPU physical memory only once at the beginning because an allocation operation takes a lot of time depends on the required memory size. Due to the fact that I’m working with a frame resolution of ~70Mbytes and my processing logics required a huge amount of auxiliary buffers, a massive GPU and CPU memory areas were required to be allocated which empty the OS virtual memory available areas.