cuSolver SVD not overlapping using streams

This question was asked previously but not answered.

see Streaming CuSolver SVD

I’m trying to use streaming with cuSolver SVD methods. For example cusolverDnCgesvd, cusolverDnCgesvdj however it appears that these functions operate only synchronously. The example code shows that streams should be supported. I have tested my code without the SVD and it overlaps correctly. However, adding the SVD methods causes it become synchronous.

Is there a known reason for these methods to not be stream able?

I am using at RTX 3090 GPU with matrix sizes of about 1024 x 1024.

Hi Chris,

cuSOLVER routines are not necessarily asynchronous with respect to the host thread. With the current implementation, it is expected that cusolverDn?gesvd and cusolverDn?gesvdj synchronize the host thread with the current CUDA stream. As a workaround you could use multiple host threads, each operating on its own cuSOLVERDn handle and stream.

As we continuously work on improving/extending the cuSOLVER APIs, we would be excited to learn more about your application. Are your input matrices of the exact same size (uniform batched), or of different sizes (variably batched)? How many input matrices are given in your case?

Thanks,
Christoph

Our matrices are the same size, for example 256x256, 512x512 as large as 2048x2048. We have perhaps 100 matrices to compute in a set. The SVD is one part of a larger processing chain.

I have tried using multiple host threads, each with its own handle and stream and have not seen any amount of overlapping. I have tried small matrices also (64x64) to see if it was a resource issue. I’m attaching an example of how I created the test; i have tried variants with similar results.

void TestParallelSVD()
{
constexpr int XRES = 256;
auto pfcInputData = std::make_unique<cuComplex>(XRES * XRES);

// Create dummy input
for (int i = 0; i < XRES * XRES; ++i)
{
	pfcInputData[i].x = static_cast<float>(i % 316);
	pfcInputData[i].y = static_cast<float>(i % 247);
}

struct SVD_INSTANCE
{
	cuComplex* h_pfcSrc;
	cuComplex* h_pfcU;
	cuComplex* d_svdWork;
	cuComplex* d_fcV;
	cuComplex* d_fcZ;
	float* d_svdRWork;
	float* d_fSvdS;
	cuComplex* d_fcSvdU;
	int* d_svdInfo;
	cusolverDnHandle_t d_cusolverHandle;
	cudaStream_t stream;
	int lwork;

	int id;
};

constexpr int iNumThreads = 16;

// Initialize instance data for each thread
std::vector<SVD_INSTANCE> vecSVD(iNumThreads);
for (int i = 0; i < iNumThreads; ++i)
{
	auto& SVD = vecSVD[i];

	SVD.id = i;

	auto result = cudaStreamCreateWithFlags(&SVD.stream, cudaStreamNonBlocking);

	auto status = cusolverDnCreate(&SVD.d_cusolverHandle);

	status = cusolverDnSetStream(SVD.d_cusolverHandle, SVD.stream);

	auto cr = cudaMallocHost((void**)&SVD.h_pfcSrc, sizeof(cuComplex) * XRES * XRES);
	cr = cudaMallocHost((void**)&SVD.h_pfcU, sizeof(cuComplex) * XRES * XRES);

	cr = cudaMalloc(reinterpret_cast<void**>(&SVD.d_fcZ), sizeof(cuComplex) * XRES * XRES);
	cr = cudaMalloc(reinterpret_cast<void**>(&SVD.d_fSvdS), sizeof(float) * XRES);
	cr = cudaMalloc(reinterpret_cast<void**>(&SVD.d_fcSvdU), sizeof(cuComplex) * XRES * XRES);
	cr = cudaMalloc(reinterpret_cast<void**>(&SVD.d_fcV), sizeof(cuComplex) * XRES * XRES);
	cr = cudaMalloc(reinterpret_cast<void**>(&SVD.d_svdInfo), sizeof(int));
	
	const int m = XRES;
	const int n = XRES;
	status = cusolverDnDgesvd_bufferSize(SVD.d_cusolverHandle, m, n, &SVD.lwork);

	cr = cudaMalloc(reinterpret_cast<void**>(&SVD.d_svdWork), sizeof(cuComplex) * SVD.lwork);

	cr = cudaMalloc(reinterpret_cast<void**>(&SVD.d_svdRWork), sizeof(float) * std::min(m, n));
}

auto svdTask = [&](int iThread)
	{
		auto& SVD = vecSVD[iThread];
	
		auto cr = cudaMemcpyAsync(SVD.d_fcZ, SVD.h_pfcSrc, XRES * XRES * sizeof(cuComplex), cudaMemcpyHostToDevice, SVD.stream);

		// matrix is square so is easy
		const int m = XRES;
		const int n = XRES;
		const int lda = m;
		const int ldu = m;
		const int ldv = n;

		constexpr signed char jobu = 'A';  // all m columns of U
		constexpr signed char jobvt = 'N'; // no rows of VT
		auto status = cusolverDnCgesvd(
			SVD.d_cusolverHandle, jobu, jobvt,
			m,
			n,
			SVD.d_fcZ,
			lda,
			SVD.d_fSvdS,
			SVD.d_fcSvdU,
			ldu,
			SVD.d_fcV,
			ldv,
			SVD.d_svdWork, SVD.lwork, SVD.d_svdRWork, SVD.d_svdInfo);

		cr = cudaMemcpyAsync(SVD.h_pfcU, SVD.d_fcSvdU, XRES * XRES * sizeof(cuComplex), cudaMemcpyDeviceToHost, SVD.stream);
	};

constexpr int iNumTests = 3;

for (int iTest=0; iTest< iNumTests; ++iTest)
{
	auto start = std::chrono::high_resolution_clock::now();

	// Run as threads
	{
		std::vector<std::jthread> vecThreads;

		for (int i = 0; i < iNumThreads; ++i)
		{
			vecThreads.emplace_back(svdTask, i);
		}

		// Syncronize
		for (int i = 0; i < iNumThreads; ++i)
		{
			cudaStreamSynchronize(vecSVD[i].stream);
		}
	}

	auto end = std::chrono::high_resolution_clock::now();

	std::chrono::duration<double, std::milli> duration = end - start;

	std::cout << "Time taken threaded: " << duration.count() << "ms" << std::endl;
}			

// Free
for (int i = 0; i < iNumThreads; ++i)
{
	auto& SVD = vecSVD[i];

	cudaFreeHost(SVD.h_pfcSrc);
	cudaFreeHost(SVD.h_pfcU);

	cudaFree(SVD.d_fcZ);
	cudaFree(SVD.d_fSvdS);
	cudaFree(SVD.d_fcSvdU);
	cudaFree(SVD.d_fcV);
	cudaFree(SVD.d_svdInfo);

	cusolverDnDestroy(SVD.d_cusolverHandle);
	cudaStreamDestroy(SVD.stream);
}

cudaDeviceReset();

}

have not seen any amount of overlapping

After having a look at your snippet and executing it, I see overlapping kernels from the different host threads in Nsight System. This also aligns with the measured time results, when I modify the number of threads. The overlapping is not perfect, but that is expected with this workaround. Could you please specify your observations and CUDA Toolkit version? E.g., duration with iNumThreads=1 and iNumThreads=16?

Thanks,
Christoph

Nsight version 2024.5.1,
RTX Driver (most recent, today)
CUDA Toolkit 12.5 cuda_12.5.1_555.85_windows (however, looks like there is a more recent version which I can try)

my timing results are as follows (3 “tests” run)

// iNumThreads = 1, XRES=1024
Time taken threaded: 85.5693ms
Time taken threaded: 67.1857ms
Time taken threaded: 68.1887ms

// iNumThreads = 16, XRES=1024
Time taken threaded: 80.8566ms
Time taken threaded: 77.4005ms
Time taken threaded: 77.9052ms

(note I divided total time by the number of threads)

I have also tried nsight and i might be confused, this is what I see for iNumThreads = 1; With 1 thread, the processing time in insight is about 100 ms.

When I switch to 16 threads, i do see 16 streams active, and they do appear to be overlapping, but the time for each is stretched to about 2200 ms.

I would rather check with XRES=256 to ensure the kernels are small enough to allow overlap. The relevant part in Nsight System to check if there is kernel overlap can be found in the top row “[All Streams]”, which can be expanded. However, the main reason here may be that the input matrices are uninitialized (SVD.h_pfcSrc not initialized). I would recommend to check return codes + info code to ensure everything was executed as expected.

windows may present some challenges on a RTX 3090 card with concurrency, try both settings of windows hardware accelerated GPU scheduling to see if either setting provides better concurrency. Just do a google search on “windows hardware accelerated GPU scheduling” and take the first blog hit from microsoft.

Thank you both. I have tried adjusting the scheduling and have not directly seen a change, but I have opted in regardless. My latest experiments seem to show an advantage to using threads in this way; for example with a matrix size of 1024x1024 and 8 threads my tests show approx. 200 ms single threaded, and 120 ms threaded.

I am not able to tell exactly how the overlapping is working, the SVD function has a lot of internal kernels which make the nsight difficult to examine.

The performance seems also to improve with the size of the SVD computed which suggests maybe something else is happening.

Regardless, this seems to be about as good as I can make it and was a useful process which I think I can adapt into my application.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.