Splitting large datasets to fit into device memory Algorithm, implementation, and problems

As some of you may have found the hard way, many problems approachable by CUDA are limited by the amount of global memory. The short story is that the dataset to be processed simply doesn’t fit on the GPU.

The long story is that we realize this, and set up a scheme to split the data into several segments to be processed by different kernel calls. I am going to discuss the memory allocation routine for one of my kernels, how it works, and where and why it fails.

Background (feel free to skip):

We are calculating electric field lines from a set of starting points, and a set amount of static point chages. We have n field lines, and thus the big array holding the field lines is called the ‘n’ array.

Suppose n[sub]i[/sub] represents a field line; then n[sub]ij[/sub] represents the j[sup]th[/sup] element of n[sub]i[/sub]; n[sub]i0[/sub] represents the first element, n[sub]i1[/sub] represents the second element, and so on. The array is organized as n[sub]00[/sub] n[sub]10[/sub] n[sub]20[/sub] n[sub]30[/sub] n[sub]40[/sub] … n[sub]01[/sub] n[sub]11[/sub] n[sub]21[/sub] n[sub]31[/sub] n[sub]41…[/sub] n[sub]02[/sub] n[sub]12[/sub] n[sub]22[/sub] n[sub]32[/sub] n[sub]42[/sub], etc. Ignoring the complications when trying to display the data, this is the best arrangement for the kernel, as it enables linear reads from the memory for each step, since the (j+1)[sup]th[/sup] step is dependent on the j[sup]th[/sup] step.

In order to achieve coalescing, we split each 3-component vector n[sub]ij[/sub] into a 2-component xy and 1-component z vector thus you will see an xyAlloc and zAlloc stage, but they bith refer to the n array.

We also have p point charges, and we call that the ‘p’ array, or the ‘pee’ array. We absolutely need the entire p array in device memory in order to be able to compute the j[sup]th[/sup] step of n without writing a new kernel.

If the n array doesn’t fit in the memory, we can split it along the i or j directions. Since the other routines were already set up for splitting along the i direction from the multi-GPU scheme, I decided to split the data once again along the i direction. Also, splitting it along the i direction makes things more interesting, as we need to worry about efficiently using all the stream processors, and leaving no partially occupied warps, a concept I like to call multiplicity. The concept applies to more generic splitting schemes. That, and I like to take the path of most resistance.

End of Background

The question is just how much memory to allocate so that

a) We don’t get an “out of memory” error, and

B) We consider multiplicity, and don’t idiotically divide n by the number of segments.

The first step is to determine the amount of allocable memory, or safeRAM. I will discuss later why this is a huge problem, but for now, let’s just take it as half of the total global memory.

The question arises on how to partition the little memory that we have. In most kernels, there is a small amount of constant data that is needed in its entirety for the kernel to be able to process the data. In this case, the p array is such data. What we can do is determine how much the p array will occupy, and just subtract that from safeRAM. Hopefully, at this point, we still have enough memory for the n array.

Now we need to find a way to partition the n array efficiently. We are going to assume that we don’t have enough memory to fit all data. The method I chose is to split the n array into minimal grids. These minimal grids should contain just enough data to fill al MPs. We know that each field line gets its own thread, and we know the dimensions of the block, bDim. That’s enough information to calculate the amount of memory needed for a block,

blockRAM = bDim * elementsPerThread * sizeOfElement;

If we knew the minimum multiple of the number of blocks to put, we could construct the minimal grid. We need the multiplicity to be a multiple of the number of multiprocessors, to that all MPs are fully utilized. There are two ways to obtain it, either requested by the calling function, or simply as a multiple of the number of multiprocessors. For this implementation, I chose the number of multiprocessors as the multiplicity, if the calling function does not supply it. Therefore, the memory that we need for a grid becomes

gridRAM = multiplicity * blockRAM;

That is also the minimum memory that we need to be able to call a kernel. We can check if that is available, and just return an error otherwise.

Since the minimal grid takes care of underutilization and multiplicity, all we have left to do is see how many grids we can fit in the available memory.

availGrids = safeRAM/gridRAM;

This division will round down, and we are guaranteed to be able to fit the availGrids in safeRAM. Now we need to determine the needed amount of grids:

neededGrids = (neededRAM + gridRAM - 1)/gridRAM;

Notice how this division rounds up, so we are guaranteed to have an extra grid if neededRAM is not a multiple of gridRAM.

At this point, we have two numbers to work with: neededGrids, and availGrids. We need to compute the number of blocks that we will fit per segment, and the number of segments, or kernel call we need to make to finish processing the dataset. We can do that by dividing neededGrids by availGrids;

*segments = ( (neededGrids + availGrids - 1) )/ availGrids;

Again, we round up to make sure we don’t miss any grids. Now we need to compute the number of blocks we are going to allocate. It’s very tempting to just allocate as many blocks as we can (availGridsblocksPerGrid, or availGrids * multiplicity); however, when we need less grids than are available, and thus only one segment, we would just be wasting memory, so we only need to allocate neededGridsmultiplicity.

When we need more grids than are available, we have two options:

a) Simply allocate as many blocks as available, availGrids*multiplicity. This way each kernel call but the last will process as much data as possible, and the last will process any remainders.

B) Allocate (neededGrids/segments)*multiplicity blocks. Note that we would need to round up, so we would have ( (neededGrids + segments -1 ) / segments) * multiplicity. This way, each kernel call processes an equal amount of data

There are advantages and disadvantages to each of the above methods; I chose the latter. Thus, we have:

*blocksPerSeg = ( (neededGrids > availGrids) ? ( (neededGrids + *segments -1)/ *segments) : neededGrids ) * blockMultiplicity;

This takes care of both cases discussed above.

The rest of the code simply uses *blocksPerSeg to allocate memory, nothing complicated here. To keep the discussion thread cleaner I will discuss the one major fault of this algorithm in the next post.

Aaah, and this is the full implementation of the function:

template<class T>

cudaError_t CalcField_GPUmalloc(PointChargeArray<T> *chargeData, CoalescedFieldLineArray<T> *GPUlines,

								const unsigned int bDim, size_t *segments, size_t* blocksPerSeg, size_t blockMultiplicity = 0, const bool useFullRAM = false)


	//--------------------------------Sizing determination--------------------------------------//

	// Find the available memory for the current GPU

	int currentGPU; cudaGetDevice(&currentGPU);

	cudaDeviceProp currentProp; cudaGetDeviceProperties(&currentProp, currentGPU);

	// If multiplicity is not specifies, take it as the number of multiprocessors on the device

	if(!blockMultiplicity) blockMultiplicity = currentProp.multiProcessorCount;

	// Compute the amount of memory that is considered safe for allocation

	size_t safeRAM = currentProp.totalGlobalMem;

	if(!useFullRAM) safeRAM /= 2; // Half of the total global memory is considered safe

	// Find the memory needed for the point charges

	chargeData->paddedSize = ((chargeData->charges + bDim -1)/bDim) * bDim * sizeof(pointCharge<T>);

	// Compute the available safe memory for the field lines

	safeRAM -= chargeData->paddedSize;	// Now find the amount remaining for the field lines

	// FInd the total amount of memory required by the field lines

	const size_t fieldRAM = GPUlines->steps*GPUlines->lines*sizeof(Vector3<T>);

	// Find the memory needed for one grid of data, containing 'blockMultiplicity' blocks

	const size_t gridRAM = blockMultiplicity * bDim * sizeof(Vector3<T>) * GPUlines->steps;

	const size_t availGrids = safeRAM/gridRAM;

	const size_t neededGrids = (fieldRAM + gridRAM - 1)/gridRAM;

	// Make sure enough memory is available for the entire grid

	if(gridRAM > safeRAM)


		// Otherwise, the allocation cannot continue with specified multiplicity

		fprintf(stderr, " Cannot assign enough memory for requested multplicity: %u\n", blockMultiplicity);

		fprintf(stderr, " Minimum of %uMB available video RAM needed\n", (gridRAM + chargeData->paddedSize)/1024/1024);

		*segments = 0;

		return cudaErrorMemoryAllocation;


	// Find the number of segments the kernel needs to be split into

	*segments = ( (neededGrids + availGrids - 1) )/ availGrids;	// (neededGrids + availGrids - 1)/availGrids

	*blocksPerSeg = ( (neededGrids > availGrids) ? ( (neededGrids + *segments -1)/ *segments) : neededGrids ) * blockMultiplicity;		// availGrids * blocksPerGrid ('blockMultiplicity')

	// Find the number of safely allocatable lines per kernel segment given the

	const size_t linesPerSeg = (*blocksPerSeg) * bDim;	// blocksPerSegment * blockDim	*(1 linePerThread)

#if _DEBUG

	// Check amount of needed memory

	size_t neededRAM = fieldRAM + chargeData->paddedSize;

	printf(" Call requires %u MB to complete, %u MB per allocation.\n",

		(unsigned int) neededRAM/1024/1024, (linesPerSeg * sizeof(Vector3<T>) * GPUlines->steps + chargeData->paddedSize)/1024/1024);


	enum mallocStage {chargeAlloc, xyAlloc, zAlloc};

	cudaError_t errCode;

	//--------------------------------Point charge allocation--------------------------------------//

	errCode = cudaMalloc((void**)& chargeData->chargeArr, chargeData->paddedSize);

	if(errCode != cudaSuccess)


		fprintf(stderr, " Error allocating memory in function %s at stage %u on GPU%i.\n", __FUNCTION__, chargeAlloc, currentGPU);

		fprintf(stderr, " Failed batch size %u\n", chargeData->paddedSize);

		fprintf(stderr, "\t: %s\n", cudaGetErrorString(errCode));

		return errCode;


	//--------------------------------Field lines allocation--------------------------------------//

	const size_t xyCompSize = (sizeof(Vector3<T>)*2)/3; 

	const size_t zCompSize = (sizeof(Vector3<T>))/3;

	// Here we are going to assign padded memory on the device to ensure that transactions will be 

	// coalesced even if the number of lines may create alignment problems

	// We use the given pitches to partition the memory on the host to mimic the device

	// This enables a blazing-fast linear copy between the host and device

	errCode = cudaMallocPitch((void**) &GPUlines->coalLines.xyInterleaved, &GPUlines->xyPitch, xyCompSize * linesPerSeg, GPUlines->steps);

	if(errCode != cudaSuccess)


		fprintf(stderr, " Error allocating memory in function %s at stage %u on GPU%i.\n", __FUNCTION__, xyAlloc, currentGPU);

		fprintf(stderr, " Failed %u batches %u bytes each\n", GPUlines->steps, xyCompSize * linesPerSeg);

		fprintf(stderr, "\t: %s\n", cudaGetErrorString(errCode));

		// Free any previously allocated memory


		return errCode;


	errCode = cudaMallocPitch((void**) &GPUlines->coalLines.z, &GPUlines->zPitch, zCompSize * linesPerSeg, GPUlines->steps);

	if(errCode != cudaSuccess)


		fprintf(stderr, " Error allocating memory in function %s at stage %u on GPU%i.\n", __FUNCTION__, zAlloc, currentGPU);

		fprintf(stderr, " Failed %u batches %u bytes each\n", GPUlines->steps, zCompSize * linesPerSeg);

		fprintf(stderr, "\t: %s\n", cudaGetErrorString(errCode));

		// Free any previously allocated memory


		return errCode;


	return cudaSuccess;


If we could determine safeRAM exactly, this scheme would allow us to process indefinitely large datasets, as long as gridRAM fits in safeRAM. We would be guaranteed that the allocation will never fail, and we could move on to a cigarrete break or a beer, then come and brag to tmurray how awesome we are. :P
Unfortunately, the main problem with this implementation is, as I mentioned previously, determining exactly how much memory we have available, or how big safeRAM should be. If we have a GPU with no display, then we should be able to safely assume all available global memory, minus a few megabytes. That’s not the case, as my tests have shown (more on this later).
But what if we only have one GPU, or want to run the kernel on a GPU with a display attached? Is the user running Windows with Aero or XP, or running Linux with KDE, or a very light desktop. Are there any other OpenGL or DirectX programs running?

As it turns out we can only approximate what safeRAM should be. It seems that half the memory is a very conservative estimate. In reality, it is a very idiotic one, since it leaves a large amount of memory unused. At the same time it is too lavish, since I have seen the following cudaMalloc() and cudaMallocPitch() return “out of memory”. There is conflicting data on how much is safe.

eyalhir74 said [post=“534323”]here[/post] that he uses a similar scheme, and considers safe to be 80% of the available memory.
My tests show even more intriguing results. I have seen the memory allocation fail when trying to allocate as little as 196MB on a 512MB GPU with two displays attached, runnng Aero, but I have also seen a 205MB allocation fail on a 512MB GPU with no displays attached, and no other CUDA contexts.
It would be nice if we had a function cudaTmurraysAwesomeAvailableGlobalMemoryFunction(), but until further notice, we are stuck with guessing.

So why would malloc fail on a GPU with nothing else attached, and how much is memory do you think is actually safe to assume available?