Call to cusolverDnSgesvd() succeeds in toy example, fails in production code

When I run cusolverDnSgesvd() I get the following error

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x7ffee4f6cc30  copy_info_kernel(int, int*)

But this does not happen if I run the exact same parameters in a minimal example. It only happens in context of a large application. I am wondering if there could be a naming conflict with something internal to cusolver library? Only guessing. Please help.

I am trying to decompose a 1024x1024 matrix of floats.

An illegal address could mean that on a particular invocation of cusolverDnSgesvd(), one or more of the addresses you pass are not properly allocated on the device side. That’s probably going to be difficult to simulate in a newly-freshly-created minimal example. Instead, start by trimming things from your application until the error disappears, then add back and trim anything else, until you can’t trim anymore.

Also, running the code with compute-sanitizer will give you a bit more info about the offending address.

Parameters A, S, U, VT, work are all allocated on the device side (in my production code).
Sizes:
A, S, U, VT = 4194304,
work =13911040
The only one I didn’t allocate was rwork because the example didn’t do so and I don’t know what size it should be.
I will try running with compute-sanitizer now.

compute-sanitizer does not really tell me anything more:

========= Invalid __global__ write of size 4 bytes
=========     at copy_info_kernel(int, int *)+0x30
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7feb337c085c is out of bounds
=========     and is 8,648,612 bytes before the nearest allocation at 0x7feb34000000 of size 16,793,600 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========         Host Frame:  [0x381257] in libcuda.so.1
=========         Host Frame:  [0x97cd55] in libcusolver.so.11
=========         Host Frame:  [0x9e3297] in libcusolver.so.11
=========         Host Frame:  [0x7bf90d] in libcusolver.so.11
=========         Host Frame:  [0x7bf990] in libcusolver.so.11
=========         Host Frame:  [0x59850e] in libcusolver.so.11
=========         Host Frame: cusolverDnSgesvd [0x59a1c9] in libcusolver.so.11

Here is the exact toy code. If I run this in its own main() function, it succeeds.
If I literally copy it verbatim into my production code, it fails.

	{
		constexpr int w = 1024;
		constexpr int h = 1024;
		constexpr int area = w * h;
		
		float* input;
		float* U;
		float* V;
		float* sv;
		cudaMalloc(&input, area * sizeof(float));
		cudaMalloc(&U, area * sizeof(float));
		cudaMalloc(&V, area * sizeof(float));
		cudaMalloc(&sv, area * sizeof(float));
		
		cusolverDnHandle_t solver_handle;
		int workspace_size, dev_info;
		float* workspace;
		cusolverDnCreate(&solver_handle);
		cusolverDnSgesvd_bufferSize(solver_handle, h, w, &workspace_size);
		printf("%d\n", workspace_size); fflush(stdout);
		cudaMalloc(&workspace, workspace_size * sizeof(float));
		cusolverDnSgesvd(solver_handle, 'N', 'N', h, w, input, h, sv, U, h, V, w, workspace, workspace_size, nullptr, &dev_info);
		cusolverDnDestroy(solver_handle);
	}

devInfo needs to be a device-side allocation. I have it as a host side thing like workspace_size.

Why it seems to work as a small code but not a big code remains a mystery.

run your small code with compute-sanitizer. That’s a good general validation of any CUDA code before committing to production.

The reason it seems to work is that the run-time error checking (e.g. for out-of-bounds access) is not as tight as the error checking done by compute-sanitizer. A similar observation can be made with CPU code. Accessing an array one location beyond its end will generally not result in an runtime error in CPU code, but running that same code with valgrind will show the error.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.