cusparse coo2csr function hangs

Hello,
I am encountering a problem where the COO to CSR sparse matrix conversion fails. I get an error saying ‘kernel launch suspended’. I don’t see how this can fail. I believe I have no out of bound access. Can someone help me understand this? Thank you!

Here is my code:

#define NNZ 6609553

#define IVEC_SIZE 16384
#define DATA_SIZE 43200

// load system indices
	int *sys_rows, *sys_cols;
	int* sys_rows_h = new int[NNZ];
	int* sys_cols_h = new int[NNZ];
	fp = fopen(SYSMATX_ROWSCOLS,"rb");  // load indices from a file.
	fread(sys_rows_h,sizeof(int),NNZ,fp);
	fread(sys_cols_h,sizeof(int),NNZ,fp);
	fclose(fp);
	cudaMalloc<int> (&sys_cols, NNZ);
	cudaMemcpy(sys_cols,sys_cols_h,sizeof(int)*NNZ,cudaMemcpyHostToDevice);
	cudaMalloc<int> (&sys_rows, NNZ);
	cudaMemcpy(sys_rows,sys_rows_h,sizeof(int)*NNZ,cudaMemcpyHostToDevice);
	
	/* initialize cusparse library */
	cusparseStatus_t status;
	cusparseHandle_t handle=0;
	cusparseMatDescr_t descr=0;
	status= cusparseCreate(&handle);
	if (status != CUSPARSE_STATUS_SUCCESS) { return 1;}

	// convert to Compressed Sparse Row (CSR) format
	int* sys_row_ptr;
	cudaMalloc<int> (&sys_row_ptr, DATA_SIZE + 1);
	status = cusparseXcoo2csr(handle, sys_rows, NNZ, DATA_SIZE, sys_row_ptr, CUSPARSE_INDEX_BASE_ONE);

UPDATE:
I fixed the problem. It seems the templated CUDAMALLOC calls were not working properly. When I changed all the CUDAMALLOC calls from C++ to C (for instance, cudaMalloc (&sys_rows, NNZ) to cudaMalloc (&sys_rows, sizeof(int)*NNZ) ), the program works.

Although I fixed the problem, I’d like to use the templated functions in the future. How can I do that?