Splitting large datasets to fit into device memory Algorithm, implementation, and problems

Mr_Nuke · April 25, 2009, 5:01pm

As some of you may have found the hard way, many problems approachable by CUDA are limited by the amount of global memory. The short story is that the dataset to be processed simply doesn’t fit on the GPU.

The long story is that we realize this, and set up a scheme to split the data into several segments to be processed by different kernel calls. I am going to discuss the memory allocation routine for one of my kernels, how it works, and where and why it fails.

Background (feel free to skip):

We are calculating electric field lines from a set of starting points, and a set amount of static point chages. We have n field lines, and thus the big array holding the field lines is called the ‘n’ array.

Suppose n[sub]i[/sub] represents a field line; then n[sub]ij[/sub] represents the j[sup]th[/sup] element of n[sub]i[/sub]; n[sub]i0[/sub] represents the first element, n[sub]i1[/sub] represents the second element, and so on. The array is organized as n[sub]00[/sub] n[sub]10[/sub] n[sub]20[/sub] n[sub]30[/sub] n[sub]40[/sub] … n[sub]01[/sub] n[sub]11[/sub] n[sub]21[/sub] n[sub]31[/sub] n[sub]41…[/sub] n[sub]02[/sub] n[sub]12[/sub] n[sub]22[/sub] n[sub]32[/sub] n[sub]42[/sub], etc. Ignoring the complications when trying to display the data, this is the best arrangement for the kernel, as it enables linear reads from the memory for each step, since the (j+1)[sup]th[/sup] step is dependent on the j[sup]th[/sup] step.

In order to achieve coalescing, we split each 3-component vector n[sub]ij[/sub] into a 2-component xy and 1-component z vector thus you will see an xyAlloc and zAlloc stage, but they bith refer to the n array.

We also have p point charges, and we call that the ‘p’ array, or the ‘pee’ array. We absolutely need the entire p array in device memory in order to be able to compute the j[sup]th[/sup] step of n without writing a new kernel.

If the n array doesn’t fit in the memory, we can split it along the i or j directions. Since the other routines were already set up for splitting along the i direction from the multi-GPU scheme, I decided to split the data once again along the i direction. Also, splitting it along the i direction makes things more interesting, as we need to worry about efficiently using all the stream processors, and leaving no partially occupied warps, a concept I like to call multiplicity. The concept applies to more generic splitting schemes. That, and I like to take the path of most resistance.

End of Background

The question is just how much memory to allocate so that

a) We don’t get an “out of memory” error, and

B) We consider multiplicity, and don’t idiotically divide n by the number of segments.

The first step is to determine the amount of allocable memory, or safeRAM. I will discuss later why this is a huge problem, but for now, let’s just take it as half of the total global memory.

The question arises on how to partition the little memory that we have. In most kernels, there is a small amount of constant data that is needed in its entirety for the kernel to be able to process the data. In this case, the p array is such data. What we can do is determine how much the p array will occupy, and just subtract that from safeRAM. Hopefully, at this point, we still have enough memory for the n array.

Now we need to find a way to partition the n array efficiently. We are going to assume that we donâ€™t have enough memory to fit all data. The method I chose is to split the n array into minimal grids. These minimal grids should contain just enough data to fill al MPs. We know that each field line gets its own thread, and we know the dimensions of the block, bDim. Thatâ€™s enough information to calculate the amount of memory needed for a block,

blockRAM = bDim * elementsPerThread * sizeOfElement;

If we knew the minimum multiple of the number of blocks to put, we could construct the minimal grid. We need the multiplicity to be a multiple of the number of multiprocessors, to that all MPs are fully utilized. There are two ways to obtain it, either requested by the calling function, or simply as a multiple of the number of multiprocessors. For this implementation, I chose the number of multiprocessors as the multiplicity, if the calling function does not supply it. Therefore, the memory that we need for a grid becomes

gridRAM = multiplicity * blockRAM;

That is also the minimum memory that we need to be able to call a kernel. We can check if that is available, and just return an error otherwise.

Since the minimal grid takes care of underutilization and multiplicity, all we have left to do is see how many grids we can fit in the available memory.

availGrids = safeRAM/gridRAM;

This division will round down, and we are guaranteed to be able to fit the availGrids in safeRAM. Now we need to determine the needed amount of grids:

neededGrids = (neededRAM + gridRAM - 1)/gridRAM;

Notice how this division rounds up, so we are guaranteed to have an extra grid if neededRAM is not a multiple of gridRAM.

At this point, we have two numbers to work with: neededGrids, and availGrids. We need to compute the number of blocks that we will fit per segment, and the number of segments, or kernel call we need to make to finish processing the dataset. We can do that by dividing neededGrids by availGrids;

*segments = ( (neededGrids + availGrids - 1) )/ availGrids;

Again, we round up to make sure we donâ€™t miss any grids. Now we need to compute the number of blocks we are going to allocate. Itâ€™s very tempting to just allocate as many blocks as we can (availGridsblocksPerGrid, or availGrids * multiplicity); however, when we need less grids than are available, and thus only one segment, we would just be wasting memory, so we only need to allocate neededGridsmultiplicity.

When we need more grids than are available, we have two options:

a) Simply allocate as many blocks as available, availGrids*multiplicity. This way each kernel call but the last will process as much data as possible, and the last will process any remainders.

B) Allocate (neededGrids/segments)*multiplicity blocks. Note that we would need to round up, so we would have ( (neededGrids + segments -1 ) / segments) * multiplicity. This way, each kernel call processes an equal amount of data

There are advantages and disadvantages to each of the above methods; I chose the latter. Thus, we have:

*blocksPerSeg = ( (neededGrids > availGrids) ? ( (neededGrids + *segments -1)/ *segments) : neededGrids ) * blockMultiplicity;

This takes care of both cases discussed above.

The rest of the code simply uses *blocksPerSeg to allocate memory, nothing complicated here. To keep the discussion thread cleaner I will discuss the one major fault of this algorithm in the next post.

Aaah, and this is the full implementation of the function:

template<class T>

cudaError_t CalcField_GPUmalloc(PointChargeArray<T> *chargeData, CoalescedFieldLineArray<T> *GPUlines,

								const unsigned int bDim, size_t *segments, size_t* blocksPerSeg, size_t blockMultiplicity = 0, const bool useFullRAM = false)

{

	//--------------------------------Sizing determination--------------------------------------//

	// Find the available memory for the current GPU

	int currentGPU; cudaGetDevice(&currentGPU);

	cudaDeviceProp currentProp; cudaGetDeviceProperties(&currentProp, currentGPU);

	// If multiplicity is not specifies, take it as the number of multiprocessors on the device

	if(!blockMultiplicity) blockMultiplicity = currentProp.multiProcessorCount;

	// Compute the amount of memory that is considered safe for allocation

	size_t safeRAM = currentProp.totalGlobalMem;

	if(!useFullRAM) safeRAM /= 2; // Half of the total global memory is considered safe

	// Find the memory needed for the point charges

	chargeData->paddedSize = ((chargeData->charges + bDim -1)/bDim) * bDim * sizeof(pointCharge<T>);

	// Compute the available safe memory for the field lines

	safeRAM -= chargeData->paddedSize;	// Now find the amount remaining for the field lines

	// FInd the total amount of memory required by the field lines

	const size_t fieldRAM = GPUlines->steps*GPUlines->lines*sizeof(Vector3<T>);

	// Find the memory needed for one grid of data, containing 'blockMultiplicity' blocks

	const size_t gridRAM = blockMultiplicity * bDim * sizeof(Vector3<T>) * GPUlines->steps;

	const size_t availGrids = safeRAM/gridRAM;

	const size_t neededGrids = (fieldRAM + gridRAM - 1)/gridRAM;

	// Make sure enough memory is available for the entire grid

	if(gridRAM > safeRAM)

	{

		// Otherwise, the allocation cannot continue with specified multiplicity

		fprintf(stderr, " Cannot assign enough memory for requested multplicity: %u\n", blockMultiplicity);

		fprintf(stderr, " Minimum of %uMB available video RAM needed\n", (gridRAM + chargeData->paddedSize)/1024/1024);

		*segments = 0;

		return cudaErrorMemoryAllocation;

	}

	// Find the number of segments the kernel needs to be split into

	*segments = ( (neededGrids + availGrids - 1) )/ availGrids;	// (neededGrids + availGrids - 1)/availGrids

	*blocksPerSeg = ( (neededGrids > availGrids) ? ( (neededGrids + *segments -1)/ *segments) : neededGrids ) * blockMultiplicity;		// availGrids * blocksPerGrid ('blockMultiplicity')

	// Find the number of safely allocatable lines per kernel segment given the

	const size_t linesPerSeg = (*blocksPerSeg) * bDim;	// blocksPerSegment * blockDim	*(1 linePerThread)

#if _DEBUG

	// Check amount of needed memory

	size_t neededRAM = fieldRAM + chargeData->paddedSize;

	printf(" Call requires %u MB to complete, %u MB per allocation.\n",

		(unsigned int) neededRAM/1024/1024, (linesPerSeg * sizeof(Vector3<T>) * GPUlines->steps + chargeData->paddedSize)/1024/1024);

#endif

	enum mallocStage {chargeAlloc, xyAlloc, zAlloc};

	cudaError_t errCode;

	//--------------------------------Point charge allocation--------------------------------------//

	errCode = cudaMalloc((void**)& chargeData->chargeArr, chargeData->paddedSize);

	if(errCode != cudaSuccess)

	{

		fprintf(stderr, " Error allocating memory in function %s at stage %u on GPU%i.\n", __FUNCTION__, chargeAlloc, currentGPU);

		fprintf(stderr, " Failed batch size %u\n", chargeData->paddedSize);

		fprintf(stderr, "\t: %s\n", cudaGetErrorString(errCode));

		return errCode;

	};

	//--------------------------------Field lines allocation--------------------------------------//

	const size_t xyCompSize = (sizeof(Vector3<T>)*2)/3; 

	const size_t zCompSize = (sizeof(Vector3<T>))/3;

	// Here we are going to assign padded memory on the device to ensure that transactions will be 

	// coalesced even if the number of lines may create alignment problems

	// We use the given pitches to partition the memory on the host to mimic the device

	// This enables a blazing-fast linear copy between the host and device

	errCode = cudaMallocPitch((void**) &GPUlines->coalLines.xyInterleaved, &GPUlines->xyPitch, xyCompSize * linesPerSeg, GPUlines->steps);

	if(errCode != cudaSuccess)

	{

		fprintf(stderr, " Error allocating memory in function %s at stage %u on GPU%i.\n", __FUNCTION__, xyAlloc, currentGPU);

		fprintf(stderr, " Failed %u batches %u bytes each\n", GPUlines->steps, xyCompSize * linesPerSeg);

		fprintf(stderr, "\t: %s\n", cudaGetErrorString(errCode));

		// Free any previously allocated memory

		cudaFree(chargeData->chargeArr);

		return errCode;

	};

	errCode = cudaMallocPitch((void**) &GPUlines->coalLines.z, &GPUlines->zPitch, zCompSize * linesPerSeg, GPUlines->steps);

	if(errCode != cudaSuccess)

	{

		fprintf(stderr, " Error allocating memory in function %s at stage %u on GPU%i.\n", __FUNCTION__, zAlloc, currentGPU);

		fprintf(stderr, " Failed %u batches %u bytes each\n", GPUlines->steps, zCompSize * linesPerSeg);

		fprintf(stderr, "\t: %s\n", cudaGetErrorString(errCode));

		// Free any previously allocated memory

		cudaFree(chargeData->chargeArr);cudaFree(GPUlines->coalLines.z);

		return errCode;

	};

	return cudaSuccess;

}

Mr_Nuke · April 25, 2009, 5:32pm

If we could determine safeRAM exactly, this scheme would allow us to process indefinitely large datasets, as long as gridRAM fits in safeRAM. We would be guaranteed that the allocation will never fail, and we could move on to a cigarrete break or a beer, then come and brag to tmurray how awesome we are. :P
Unfortunately, the main problem with this implementation is, as I mentioned previously, determining exactly how much memory we have available, or how big safeRAM should be. If we have a GPU with no display, then we should be able to safely assume all available global memory, minus a few megabytes. That’s not the case, as my tests have shown (more on this later).
But what if we only have one GPU, or want to run the kernel on a GPU with a display attached? Is the user running Windows with Aero or XP, or running Linux with KDE, or a very light desktop. Are there any other OpenGL or DirectX programs running?

As it turns out we can only approximate what safeRAM should be. It seems that half the memory is a very conservative estimate. In reality, it is a very idiotic one, since it leaves a large amount of memory unused. At the same time it is too lavish, since I have seen the following cudaMalloc() and cudaMallocPitch() return “out of memory”. There is conflicting data on how much is safe.

eyalhir74 said [post=“534323”]here[/post] that he uses a similar scheme, and considers safe to be 80% of the available memory.
My tests show even more intriguing results. I have seen the memory allocation fail when trying to allocate as little as 196MB on a 512MB GPU with two displays attached, runnng Aero, but I have also seen a 205MB allocation fail on a 512MB GPU with no displays attached, and no other CUDA contexts.
It would be nice if we had a function cudaTmurraysAwesomeAvailableGlobalMemoryFunction(), but until further notice, we are stuck with guessing.

So why would malloc fail on a GPU with nothing else attached, and how much is memory do you think is actually safe to assume available?

Topic		Replies	Views
GPU Allocating memory Memory allocation on GPU CUDA Programming and Performance	2	4734	April 23, 2009
Limit on cublasAlloc? CUDA Programming and Performance	16	10889	October 2, 2010
Generic memory management & optimization CUDA Programming and Performance	6	1443	November 5, 2015
very large data set (big matrix) CUDA Programming and Performance	10	3149	October 17, 2009
Cannot allocate the total amount of global memory benchmark CUDA Programming and Performance	14	10572	November 22, 2010
Cuda memory size and types allocating large chunks of memory CUDA Programming and Performance	10	2132	August 9, 2010
Allocating large arrays. CUDA Programming and Performance	6	3872	October 25, 2009
cudaMalloc fails on huge allocation CUDA Programming and Performance	4	872	March 28, 2011
Memory allocation reliablity CUDA Programming and Performance	8	3301	August 18, 2008
Tesla C1060 Allocation Problem CUDA Programming and Performance	1	2629	July 8, 2009

Splitting large datasets to fit into device memory Algorithm, implementation, and problems

Related topics