cusparseSpMatDescr_t lifecycle and resource management

Jan_W · March 7, 2022, 10:44am

Hello everyone,

I just started to implement a FIT TD solver with CUDA and some functionalities of cusparse are not that clear to me, yet.
To start with, the cusparseCreateCsr method takes only a few single integer, which are allocated on the host. The big data is allocated on the device (indices and values). The description of cusparseDestroyMat only states that it releases host memory, but it does so with the device memory as well, right? I am asking because I have splitted the generation of my final cusparseSpMatDescr_t object in in few methods. Each gets a reference to the cusparseSpMatDescr_t object and alters the entries in one way or enother. Since cusparse doesn’t provide an API for changing/deleting entries I usually create a new cusparseSpMatDescr_t object, delete the refered parameter via cusparseDestroyMat to prevent any memory leaks and assign the newly created object to the parameter. I will add an sendbox example below. Is there anything wrong with my way of handling the cusparseSpMatDescr_t objects? In my real implementation I get some inconsistent behaviour, thus I would like to know if I understood the memory handling properly.

Kind regards,
Jan

#include "CudaMemoryAnalysis.h"
#include <vector>
#include <iostream>

void main()
{
	const int dimension(5);

	try
	{
		CudaMatrixTest tester(dimension);
		tester.AssembleMatrix();
	}
	catch (std::string error)
	{
		std::cout << error;
	}
}

---------------------------------- CudaMemoryAnalysis.h -------------------------------------------------

#pragma once
#include <cuda_runtime_api.h> // cudaMalloc, cudaMemcpy, etc
#include <cusparse.h>         // cusparseSpMV
#include <stdio.h>            
#include <stdlib.h>          
#include <string>
#include <vector>


class CudaMatrixTest
{
	cusparseHandle_t _cusparseHandle;
	int _systemMatrixSize;
	cusparseIndexBase_t	_idxBase;
	cusparseSpMatDescr_t _matrixHandle;

	inline void CheckCUDAError(cudaError_t status);
	inline void CheckCUSPARSEError(cusparseStatus_t status);

	void FillMatCOOFormatVectors(std::vector<int> & rowIndices, std::vector<int> & columnIndices, std::vector<float> & values);
	void CreateCUDACSRMatrixHandleFromCOOData( std::vector<int> &rowsIndx, std::vector<int> &columnIndx, std::vector<float> &values, cusparseSpMatDescr_t & matrixHandle);
	void DeleteZeroEntries(cusparseSpMatDescr_t & matrix);
public:
	CudaMatrixTest(int matrixDimension)
	{
		_systemMatrixSize = matrixDimension;
		CheckCUSPARSEError(cusparseCreate(&_cusparseHandle));
		_idxBase = CUSPARSE_INDEX_BASE_ZERO;
	}

	void AssembleMatrix()
	{
		std::vector<int> rowIndices, columnIndices;
		std::vector<float> values;

		FillMatCOOFormatVectors(rowIndices, columnIndices, values);
		CreateCUDACSRMatrixHandleFromCOOData(rowIndices, columnIndices, values, _matrixHandle);
		ExportCUDAMatrix();
		DeleteZeroEntries(_matrixHandle);
		ExportCUDAMatrix();
	}

	void ExportCUDAMatrix();
};

inline void CudaMatrixTest::CheckCUDAError(cudaError_t status)
{
	if (status != cudaSuccess)
	{
		std::string cudaErrorString = cudaGetErrorString(status);
		std::string error = "CUDA API failed at line with error: " + cudaErrorString;
		throw error;
	}
}

inline void CudaMatrixTest::CheckCUSPARSEError(cusparseStatus_t status)
{
	if (status != CUSPARSE_STATUS_SUCCESS)
	{
		std::string error = "CUDA API failed at line " + std::to_string(__LINE__) +  " with error: " + std::string(cusparseGetErrorString(status)) + "\n";
		throw error;
	}
}

---------------------------------- CudaMemoryAnalysis.cpp -------------------------------------------------

#include "CudaMemoryAnalysis.h"

void CudaMatrixTest::FillMatCOOFormatVectors(std::vector<int> & rowIndices, std::vector<int> & columnIndices, std::vector<float> & values)
{
	int nnz = 5;
	rowIndices.reserve(nnz);
	columnIndices.reserve(nnz);
	values.reserve(nnz);

	/*
		All '-' symbolize values that are not set and therefore implicitly equal to zero.
						1	-	-	5	-
		Matrix = 		-	-	-	-	-
						-	-	3	-	-
						-	0	-	-	-
						-	-	-	-	1
	*/

	rowIndices.push_back(0);
	rowIndices.push_back(0);
	rowIndices.push_back(2);
	rowIndices.push_back(3);
	rowIndices.push_back(4);

	columnIndices.push_back(0);
	columnIndices.push_back(3);
	columnIndices.push_back(2);
	columnIndices.push_back(1);
	columnIndices.push_back(4);

	values.push_back(1.f);
	values.push_back(5.f);
	values.push_back(3.f);
	values.push_back(0.f);
	values.push_back(1.f);
}


void CudaMatrixTest::CreateCUDACSRMatrixHandleFromCOOData(std::vector<int> &rowsIndx, std::vector<int> &columnIndx,std::vector<float> &values, cusparseSpMatDescr_t & matrixHandle)
{
	int numberOfNotZeros = values.size();

	//Create CUSPARSE matrix handle out of coo format indices of non zero entries
	//First allocate device memory and copy content to device
	int *d_columns(nullptr), *d_rows(nullptr);
	float *d_values(nullptr);

	CheckCUDAError(cudaMalloc(reinterpret_cast<void **>(&d_columns), numberOfNotZeros * sizeof(int)));
	CheckCUDAError(cudaMalloc(reinterpret_cast<void **>(&d_rows), numberOfNotZeros * sizeof(int)));
	CheckCUDAError(cudaMalloc(reinterpret_cast<void **>(&d_values), numberOfNotZeros * sizeof(float)));

	CheckCUDAError(cudaMemcpy(d_columns, columnIndx.data(), numberOfNotZeros * sizeof(int), cudaMemcpyHostToDevice));
	CheckCUDAError(cudaMemcpy(d_rows, rowsIndx.data(), numberOfNotZeros * sizeof(int), cudaMemcpyHostToDevice));
	CheckCUDAError(cudaMemcpy(d_values, values.data(), numberOfNotZeros * sizeof(float), cudaMemcpyHostToDevice));

	//Turn coo format in csr by compressing the row vector
	int * d_csrRowIndices(nullptr);
	int csrVectorSize = _systemMatrixSize + 1;
	CheckCUDAError(cudaMalloc(reinterpret_cast<void **>(&d_csrRowIndices), csrVectorSize * sizeof(int)));//Actially less, but how could the size be calculated ? Buffer calculation?

	/*
		Attention: Even it is not specifically stated in the documentation, the vector that contains the compressed row indices has to be located in the device.
	*/
	CheckCUSPARSEError(cusparseXcoo2csr(_cusparseHandle, d_rows, numberOfNotZeros, _systemMatrixSize, d_csrRowIndices, _idxBase));

	CheckCUSPARSEError(cusparseCreateCsr(&matrixHandle, _systemMatrixSize, _systemMatrixSize, numberOfNotZeros,
		d_csrRowIndices, d_columns, d_values,
		CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
		CUSPARSE_INDEX_BASE_ZERO, CUDA_R_32F));
}


void CudaMatrixTest::DeleteZeroEntries(cusparseSpMatDescr_t & matrix)
{
	//Export CSR format sparse matrix
	int64_t numberOfRows, numberOfColumns, numberOfNonZeros;
	cusparseIndexType_t  indexType;
	cusparseIndexBase_t indexBase;
	cudaDataType valueType;

	void * d_rowIndices, *d_colIumnIndices, *d_values;

	CheckCUSPARSEError(cusparseCsrGet(matrix, &numberOfRows, &numberOfColumns, &numberOfNonZeros, &d_rowIndices, &d_colIumnIndices, &d_values, &indexType, &indexType, &indexBase, &valueType));

	std::vector<int> h_rowIndices(numberOfRows + 1);
	std::vector<int>  h_colIumnIndices(numberOfNonZeros);
	std::vector<float>  h_values(numberOfNonZeros);

	CheckCUDAError(cudaMemcpy(h_rowIndices.data(), d_rowIndices, (numberOfRows + 1) * sizeof(int), cudaMemcpyDeviceToHost));
	CheckCUDAError(cudaMemcpy(h_colIumnIndices.data(), d_colIumnIndices, numberOfNonZeros * sizeof(int), cudaMemcpyDeviceToHost));
	CheckCUDAError(cudaMemcpy(h_values.data(), d_values, numberOfNonZeros * sizeof(float), cudaMemcpyDeviceToHost));

	//Count NNZ
	int valuePointer(0), nnzCounter(0);
	int currentRowStart, currentRowEnd;

	for (int i = 0; i < numberOfRows; i++)
	{
		currentRowStart = h_rowIndices[i];
		currentRowEnd = h_rowIndices[i + 1];

		//Print all values contained in this row
		for (int j = currentRowStart; j < currentRowEnd; j++)
		{
			if (h_values[valuePointer] != 0.)
			{
				nnzCounter++;
			}
			valuePointer++;
		}
	}

	//Create vectors in coo format for creating a new matrix only with NZ entries
	int nnzAmount = nnzCounter;
	std::vector<int> coo_rowIndx(nnzCounter);
	std::vector<int> coo_colIndx(nnzCounter);
	std::vector<float> notZeroValues(nnzCounter);

	valuePointer = 0;
	nnzCounter = 0;
	for (int i = 0; i < numberOfRows; i++)
	{
		currentRowStart = h_rowIndices[i];
		currentRowEnd = h_rowIndices[i + 1];
		//Print all values contained in this row
		for (int j = currentRowStart; j < currentRowEnd; j++)
		{
			if (h_values[valuePointer] != 0.)
			{
				coo_rowIndx[nnzCounter] = i;
				coo_colIndx[nnzCounter] = h_colIumnIndices[valuePointer];
				notZeroValues[nnzCounter] = h_values[valuePointer];
				nnzCounter++;
			}
			valuePointer++;
		}
	}
	CheckCUSPARSEError(cusparseDestroySpMat(matrix));
	cusparseSpMatDescr_t tempMat;
	CreateCUDACSRMatrixHandleFromCOOData( coo_rowIndx, coo_colIndx, notZeroValues, tempMat);
	matrix = tempMat;
}


void CudaMatrixTest::ExportCUDAMatrix()
{
	cudaDeviceSynchronize();
	int64_t numberOfRows, numberOfColumns, numberOfNonZeros;
	cusparseIndexType_t indexType;
	cusparseIndexBase_t indexBase;
	cudaDataType valueType;

	void * d_rowIndices, *d_colIumnIndices, *d_values;

	CheckCUSPARSEError(cusparseCsrGet(_matrixHandle, &numberOfRows, &numberOfColumns, &numberOfNonZeros, &d_rowIndices, &d_colIumnIndices, &d_values, &indexType, &indexType, &indexBase, &valueType));

	std::vector<int> h_rowIndices(numberOfRows + 1);
	std::vector<int>  h_colIumnIndices(numberOfColumns);
	std::vector<float>  h_values(numberOfNonZeros);

	CheckCUDAError(cudaMemcpy(h_rowIndices.data(), d_rowIndices, (numberOfRows + 1) * sizeof(int), cudaMemcpyDeviceToHost));
	CheckCUDAError(cudaMemcpy(h_colIumnIndices.data(), d_colIumnIndices, numberOfNonZeros * sizeof(int), cudaMemcpyDeviceToHost));
	CheckCUDAError(cudaMemcpy(h_values.data(), d_values, numberOfNonZeros * sizeof(float), cudaMemcpyDeviceToHost));
}

fbusato · March 7, 2022, 6:53pm

cusparseCreateCsr and cusparseDestroyMat don’t touch the device memory in any way. They allocate and deallocate internal host memory. They should be used as new/delete. After that a descriptor is destroyed, it cannot be used anymore. Before checking the code, did you try to run valgrind?

Jan_W · March 8, 2022, 3:07pm

Hello @fbusato,
I am working on Windows, so I have not worked with valgrind yet. At the moment I am analysing my code with the VS diagnostic tools and I actually found out that there is a memory leak on the host. I am still searching for that one.

However, I didn’t expect the destroy method to not handling the device memory. So I have to free all of the arrays, index as well as value arrays, manually via cudaFree, don’t I?

Regards

mnicely · March 8, 2022, 3:20pm

Hi Jan_W, here is a sample workflow using cusparseCreateCsr and cusparseDestroyMat , CUDALibrarySamples/spmm_csr_example.c at master · NVIDIA/CUDALibrarySamples · GitHub

Jan_W · March 8, 2022, 3:41pm

Thanks for the reference.
I actually found the memory leak. Following the example I destroyed the cusparseHandle_t object via cusparseDestroy. I just run this little one:

#include "CudaMemoryAnalysis.h"

void main()
{
	cusparseHandle_t _cusparseHandle;
	cusparseCreate(&_cusparseHandle);
	cusparseDestroy(_cusparseHandle);
}

if I make some heap snapshots after each line I see that the cusparseHandle_t object requires ~128 MB which shrinks only by 0,77 KB after calling cusparseDestroy.
Did I miss something?
MemoryLeak_cusparseHandle

Robert_Crovella · March 8, 2022, 5:48pm

The establishment of the CUDA runtime will happen implicitly at the first CUDA call in your program. In this case this would be the call to cusparseCreate. This may use system resources in addition to those that are consumed by the creation of the cusparse handle itself. These system resources in use by CUDA won’t be freed at the point of the cusparseDestroy call, although the resources needed specifically for the handle will or should be.

When I loop your create/destroy sequence for 100,000 iterations, I see no significant change in the process memory usage. So I’m not sure what you are reporting is a “leak”.

The resources in use to support the CUDA runtime will generally be freed at process termination time.

Jan_W · March 14, 2022, 11:35am

@Robert_Crovella: I wasn’t aware of the behaviour and memory requirment of the CUDA runtime. I missinterpreted it as a memory leak. Thanks for the clearification.

system · March 28, 2022, 11:35am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Some questions for the function "cusparseDcsrsv2_solve" nvc, nvc++ and nvfortran cuda	12	57	August 9, 2024
Upgrading to CUDA 12.4 broke down the application GPU-Accelerated Libraries cublas , cusparse	13	1156	July 21, 2024
cuSOLVER sparse: cusolverSpDcsrlsvqr() error GPU-Accelerated Libraries	4	3813	June 7, 2015
Help Improving Performance using cuSolver/cuSparse Routines GPU-Accelerated Libraries cuda , nsight , performance , python , pycuda	0	696	December 15, 2023
cusparseScsrilu02 breaks with large matrices GPU-Accelerated Libraries cublas , cusolver , cusparse	10	1155	June 1, 2022
Sparse matrix manipulation CUDA Programming and Performance	18	1185	January 11, 2024
Problem in basic dense to csr format conversion using CUSPARSE GPU-Accelerated Libraries	3	927	July 28, 2015
Problem with cusparseXcoo2csr and cuda-gdb CUDA Programming and Performance	9	1427	January 12, 2012
cuSparse: cusparseXnnz CUDA 8 and 9 - SEGFAULT/ BUG? GPU-Accelerated Libraries	6	1004	October 10, 2017
Passing dynamically allocated memory in kernel to sub kernel via dynamic parallelism CUDA Programming and Performance	6	692	May 3, 2019

cusparseSpMatDescr_t lifecycle and resource management

Related topics