Auto Release of CUDA memory

( Edited, updated the class to have more functionality Feb 9 2021 )
In my project I want to make sure that cudaMallocManaged memory blocks get freed. Much like shared pointers. So I wrote a .cuh file called CUDAAutoMemory.
When it goes out of scope it frees the block.

Question: Is there a better way, I have not yet discovered, to do this?

Here is the code (only the float version of the code. There is also Double, Char, Int.)

/// <summary>
/// Allocated CUDA memory that automaticaly is disposed on scope exit.
/// Can fail in which case failedFlag is true.
/// </summary>
class ScopedCUDAMemoryFloat {
public:
	float* m = NULL;
	bool failedFlag = false;

	// Empty constructor. Used in conjunction with AllocateIfNull(...)
	ScopedCUDAMemoryFloat() {

	}

	/// <summary>
	/// Allocate the cuda Memory if m is NULL. Also sets failedFlag, usually indicates out-of-memory.
	/// If dat is not NULL then copy it into this memory block.
	/// If always copy then will do a copy from dat (if not NULL) even if the block is already allocated.
	/// </summary>
	/// <param name="countOfFloats"></param>
	void AllocateIfNull(unsigned long long countOfFloats, float* dat, bool alwaysCopy) {
		if (m == NULL) {
			cudaMallocManaged((void**)&m, countOfFloats * sizeof(float));
			cudaError_t err = cudaGetLastError();
			failedFlag = (err != cudaSuccess);
			if (!failedFlag && dat)
				cudaMemcpy(m, dat, countOfFloats * sizeof(float), cudaMemcpyKind::cudaMemcpyHostToDevice);
		}
		else if(alwaysCopy && dat && !failedFlag)
			cudaMemcpy(m, dat, countOfFloats * sizeof(float), cudaMemcpyKind::cudaMemcpyHostToDevice);
	}

	/// <summary>
	/// Make the memory block. If dat is not NULL then copy the data to the new block.
	/// </summary>
	/// <param name="sz"></param>
	/// <param name="dat"></param>
	ScopedCUDAMemoryFloat(unsigned long long countOfFloats, float* dat) {
		cudaMallocManaged((void**)&m, countOfFloats * sizeof(float));
		cudaError_t err = cudaGetLastError();
		failedFlag = (err != cudaSuccess);
		if (!failedFlag && dat)
			cudaMemcpy(m, dat, countOfFloats * sizeof(float), cudaMemcpyKind::cudaMemcpyHostToDevice);
	}

	~ScopedCUDAMemoryFloat() {
		if (m != NULL) {
			cudaFree(m);
			m = NULL;
		}
	}

	float* operator&() {
		if (m == NULL) throw "Out of memory. NULL pointer access.";
		return m;
	}

	float& operator[](int i) {
		return m[i];
	}
};

Good question. I don’t know but I did something similar. I found an RAII class at codeproject.com and adapted it to work with CUDA and CPU code. The main difference is mine is templated so there is only version of code. I plan to write an article there about it some day.

Here’s the RAII class : RAII - CodeProject

Accroding the document of CUDA_C_Programming_Guide, it saids that memory from GPU can be malloced but be freed.