Problem with read access violation for large arrays in unified memory

Hi all! I’ve been searching a lot, but nothing helps me with my problem. Please try to follow despite the length of my post (if tl;dr you can jump directly to the kernel function and see the explanation of it later).

I have arrays Z and Chi of length anywhere from 1 to several thousands. I need to solve equations for Z and Chi iteratively (that is, I have equation for Z = RHS1 and Chi = RHS2 where RHS1(2) contains both Z and Chi - coupled quantities) and the fixed-point method is ideal for it. The equations are of the form

Z (w) = (const.) sum_w’ (w’ Z(w’)) (g(w - w’) - g(w + w’)) integral_x dx / ((w’ Z (w’))^2 + (x + Chi(x))^2)

Chi(x) = (const.) integral_x’ dx’ (x’ + Chi(x’)) (mu(x-x’) - mu(x+x’)) sum_w 1 / ((w Z (w))^2 + (x’ + Chi(x’))^2)

So as you can see, there are nested loops involved - the outermost loop takes care of a single Z/Chi value, while inner loops calculate sum and integral (which can’t be separated, but they can be simplified as I hinted on moving integral and/or sum far to the right as it gets).

I defined double *Z, *Chi, *Z_new, *Chi_new. Additionally, I had to define double *g, *x, *mu1 and *mu2, as the equation for Z has a function g in it and Chi uses function mu.

I pre-calculated a discretization of the integral in x, I call it x[e] (array of values in which we calculate Chi). I also precalculated function g which has argument w - w’, and values of w are on a uniform grid (so g(w-w’) can be remembered as g[abs(w-w’)] and g(w+w’) as g[w+w’+1] - only 1D array needed).

I had to precalculate function mu which is needed for Chi but due to the uneven discretization of integral, I have to keep track of mu(e-e’) and mu (e+e’) for every combination of e and e’. So I have two arrays, mu1 and mu2 and to access mu(e-e’) I have mu1[Me+e’] and mu(e+e’) is mu2[Me+e’]. Maybe this sounds crazy…

If the number of w points (those in which Z is calculated) is N, array Z is of length N, number of e points is M, so array Chi is of length M. Array g is of length 2N+1 (the largest needed index is 2N as we have to calculate g[w+w’+1]). Finally, mu1 and mu2 are arrays of length M*M.

Now to the equations: the kernel for Z looks like

device double __pi = 3.14159265358979;

global void Iterate_Z_normal(double *Z_new, const double *Z, const double *Chi, const double *g, const double *x, const double T, const int N, const int M) {

int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;

  for (int n = index; n < N; n += stride) {
  double znew = 0.;
  double wn = (2 * n + 1)*__pi*T;

     for (int m = 0; m < N; m++) {

        double wm = (2 * m + 1)*__pi*T;
        double foo = 1. / (Z[m] * Z[m] * wm * wm);
        double faZ = (2. * wm / wn)*(g[abs(n - m)] - g[n + m + 1])*foo*Z[m];

        for (int e_prime = 1; e_prime < M; e_prime++) {

        double dx = abs(x[e_prime] - x[e_prime - 1]);

        foo = 1. / (Z[m] * Z[m] * wm * wm + (x[e_prime] + Chi[e_prime]) * (x[e_prime] + Chi[e_prime]));

        double fbZ = (2 * wm / wn)*(g[abs(n - m)] - g[n + m + 1])*foo*Z[m];
        znew += 0.5*T*dx*(faZ + fbZ);
        faZ = fbZ;
     }
  }
  Z_new[n] = 1 + znew;

}
}

and I call it with

Iterate_Z_normal << <numBlocks, threadsPerBlock >> > (Z_new, Z, Chi, g, x, T, N, M);

where threadsPerBlock is some multiply of 32 and numBlocks is ceil((double)(N + threadsPerBlock - 1) / threadsPerBlock).

Not that it matters, but after each iteration step I wait for the device synchronization and call another kernel which copies Z_new to Z and Chi_new to Chi. Iterations stop when the difference between new and old values gets below prescribed threshold.

By the way, I’m using unified memory, so I’m calling cudaMallocManaged for Z, Z_new (Nsizeof(double)), Chi, Chi_new, x (Msizeof(double)), g ((2N+1)sizeof(double)), mu1, mu2 (MMsizeof(double)).

Example: if N = 160 and threadsPerBlock = 32 then numBlocks = 6. This all works well if M (size of Chi, so basically the size of the innermost loop) is not very large. But if M = ~2000 or so, after this kernel is executed (and cudaDeviceSynchronize() is called), all arrays are INVALID (even those that are passed as const double*), i.e. I’m getting read access violation for cout, or any reasonable further manipulation of either Z, Chi, Z_new etc…my suspicion is that something gets overflown in the kernel, because the inner loop is very long and/or arrays mu1, mu2 are very long (millions of elements). How can I reasonably tackle this very problem better? How many threads per block should I request and how do I prevent read access violation even if N and M are large (thousands, tens of thousands)?

Maybe the sheer size of arrays mu1 and mu2 is causing some memory problem, because if M = 2 000 than mu1 and mu2 has size M^2 = 4 000 000. Is this the case? But mu1 and mu2 are really needed, because calculating function mu is rather costly so I wanted to have it precalculated.

Is it possible to distribute the inner loops over threads to alleviate the load? But that could be a problem, as inner loops are collecting a sum so I would need a reduction for znew variable (which stores the sum for the RHS of the equation for Z).

If it helps, I have NVidia GeForce 970 GTX card with 1664 CUDA cores and 4GB video memory.

I’m very sorry, but I’m kinda of a noob and I just started with CUDA. I worked with simple CPU parallelizations with OMP before and I find this GPU parallelization somewhat more intriguing.

Thanks!

P.S.: I’m trying to emulate “2D” arrays mu1 and mu2 of dimensions dim x dim in 1D array like this: for array(n, m) I access element array[dim*n + m]. I hope it’s clear.

Use a method similar to what is described here:

https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218

to localize an out of bounds read to a particular line of kernel code. Then figure out exactly which item is causing the out-of-bounds index. Once you have identified that, use printf strategically placed and qualified in the kernel to identify the calculations leading up to that out-of-bounds index.

This is (in my opinion) ordinary debugging, and the methodology is similar regardless of the structure of your code. It is (or can be) quite tedious, of course. I call this grabbing the tiger by the tail. Then work backward. Eventually you will get to the heart of the matter.

An example of what I mean is here:

https://devtalk.nvidia.com/default/topic/999733/cuda-programming-and-performance/function-only-works-correctly-with-cuda-8-0-in-release-mode-/post/5109523/#5109523

No one can really do this for you, in part because you haven’t provided a code that somebody else could run and work on.

So, you think that rather than some memory problem caused by allocating too large arrays, some of my indices is getting out of bounds, right?

I’m trying to launch this cuda-memcheck thing, so I used “cuda-memcheck” console command in my VS, as I found it on the internet, but now there’s an error: cannot open source file cuda-memcheck. When I tried to google, the most reasonable link seemed this: https://developer.nvidia.com/cuda-memcheck, but that just redirects me to the CUDA download (which I already installed, as I can compile CUDA projects in VS)…

EDIT:

okay, this is VERY eerie. I learned I can use printf inside kernel…so I used it to print out results for index + value of the array in that index…what is very very weird that sometimes it prints out something, sometimes it doesn’t (two consecutive runs, one has results printed out, one hasn’t, like it didn’t even been inside the kernel). Moreover, the values of the arrays are zero despite the variable that’s being assigned to them is nonzero (printf(znew) shows some number even when I debug with nsight, printf(Z[n]) shows zero after Z[n] = 1 + znew). Something strange is going on…

EDIT2:

what is even crazier, now I went through debugging with Nsight - all 2000 values and it yields sane numbers, Z and Chi seemed okay, indices in bounds. So I was curious at what points the program fails only to find out…it won’t. When I debug with NSight, it gets past that critical point when it outputs the values without problem, no read access violation. When I do it outside the NSight debug, it fails. Is there some feature to all this I’m missing?

If you’re not doing rigorous error checking, you should be. Test all CUDA API return values, and do proper error checking on kernel calls. Beginners often get this wrong. If in doubt, read this:

https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api

The only problem with allocating too large an array would be an out of memory problem, and this would be immediately evident with proper error checking. no need to wonder about it.

Also, on windows, you should be sure you are not hitting a WDDM TDR timeout. You’ll get some indication of this (if you are) with proper error checking.

If you want to run cuda-memcheck, run it from an ordinary windows command prompt, not from VS. You’ll need to know where VS is depositing your executable, but this is a matter of VS knowledge, not anything to do with CUDA.

Something somewhat similar can be enabled from within Nsight:

https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/use_memory_checker.htm

I just tried gpuErrchk( cudaDeviceSynchronize() ); after each kernel execution and my console wrote: GPUassert: unspecified launch failure E:/users/…/kernel.cu 146

Line 146 is the first synchronization after the first kernel call.

Here’s the thing: when I debug with NSight, nothing out of ordinary happens. Kernels compute, synchronize, CPU access the variables after synchronization, nothing happens (“During the debugging session, if the target attempts to write to an invalid memory location, the debugger triggers a breakpoint on the offending line of code, which stops execution. The CUDA Debugger stops before executing the instruction that would have written to the invalid memory location.” - just nothing happens). Although when I do it without debugging (ctrl + f5), it crashes on several spots: sometimes at the first synchronization, sometimes after it when I try to cout the variable, sometimes after the second synchronization. Kinda random. While debugging with Nsight, never. The piece of code calling kernels looks like this:

int threadsPerBlock = 128;
int numBlocks;

double d = 1;
while (d > 1e-10) {

	numBlocks = ceil((double)(N + threadsPerBlock - 1) / threadsPerBlock);
	std::cout << "(" << numBlocks << ", " << threadsPerBlock << ") " << Z[0] << std::endl;
	Iterate_Z_normal << <numBlocks, threadsPerBlock >> > (Z_new, Z, Chi, g, x, T, N, M);

	gpuErrchk(cudaPeekAtLastError());
	gpuErrchk(cudaDeviceSynchronize());
	numBlocks = ceil((double)(M + threadsPerBlock - 1) / threadsPerBlock);
	std::cout << "(" << numBlocks << ", " << threadsPerBlock << ")" << Chi[0] << std::endl;
	Iterate_Chi_normal << <numBlocks, threadsPerBlock >> > (Chi_new, Z, Chi, x, mu1, mu2, T, N, M);

	gpuErrchk(cudaPeekAtLastError());
	gpuErrchk(cudaDeviceSynchronize());

	d = abs(Z_new[0] - Z[0]) + abs(Chi_new[1] - Chi[1]) / x[1];

	numBlocks = ceil((double)(N + threadsPerBlock - 1) / threadsPerBlock);
	copy << <numBlocks, threadsPerBlock >> > (Z, Z_new, N);

	numBlocks = ceil((double)(M + threadsPerBlock - 1) / threadsPerBlock);
	copy << <numBlocks, threadsPerBlock >> > (Chi, Chi_new, N);

	cudaDeviceSynchronize();

}

If that is satisfactory for you, then you’re all set.

Otherwise you need to grab the tiger by the tail. Take hold of something that fails. If necessary, run your app so that it reliably fails, then start working on it. See where it leads.

The device code compiler will generate substantially different code in a debug project as opposed to a release project. Every CUDA developer should keep these things in mind:

  • you should never do any performance analysis on a debug build
  • a debug build may show problems that a release build does not
  • a release build may show problems that a debug build does not
  • you usually don’t want to ship debug code; it usually runs slower than release code

This is not the greatest news, of course. It means you need to consider a variety of scenarios, and even possibly be prepared to debug a non-debug project. cuda-memcheck, in-kernel printf, and other techniques may be valuable for this.

Perhaps none of this bears directly on your problem at hand.

Actually, it is not satisfactory - the program is too slow while debugging with NSight to be of any practical use. My program fails on one of two lines (if no errcheck it’s ‘read access violation’, if errchecking the synchronization, it fails on synchronization with an ‘unknown error’ - this can’t be more mystifying if you’d ask me…), but only while not running it in NSight legacy debug mode (otherwise I’m always running in VS Debug mode until I want to launch programs for their intended purpose - to calculate something useful).

This is new to me and I’m completely lost as nothing as mysterious ever happened while running purely CPU code…this happens only on GPU.

Okay guys, I put together some code which fails (without getting into details about the equations I’m solving, it’s irrelevant here). Here it is:

#include “cuda_runtime.h”
#include “device_launch_parameters.h”

#include <stdio.h>
#include
#include

device double __pi = 3.14159265358979;

double pi = 3.14159265358979;

#define gpuErrchk(ans) { gpuAssert((ans), FILE, LINE); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, “GPUassert: %s %s %d\n”, cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}

global void Iterate_Z_normal(double *Z_new, const double *Z, const double *Chi, const double *g, const double *x, const double T, const int N, const int M) {

int n = blockIdx.x * blockDim.x + threadIdx.x;

if ((n >= 0) && (n < N)) {
	double znew = 0.;
	double wn = (2 * n + 1)*__pi*T;
	double wm;
	double foo, faZ, fbZ, dx;

	for (int m = 0; m < N; m++) {

		wm = (2 * m + 1)*__pi*T;
		foo = 1. / (Z[m] * Z[m] * wm * wm);
		faZ = (2. * wm / wn)*(g[abs(n - m)] - g[n + m + 1])*foo*Z[m];

		for (int e_prime = 1; e_prime < M; e_prime++) {

			dx = abs(x[e_prime] - x[e_prime - 1]);

			foo = 1. / (Z[m] * Z[m] * wm * wm + (x[e_prime] + Chi[e_prime]) * (x[e_prime] + Chi[e_prime]));

			fbZ = (2 * wm / wn)*(g[abs(n - m)] - g[n + m + 1])*foo*Z[m];
			znew += 0.5*T*dx*(faZ + fbZ);
			faZ = fbZ;
		}
	}
	Z_new[n] = 1 + znew;
}

}

global void Iterate_Chi_normal(double *Chi_new, const double *Z, const double *Chi, const double *x, const double *mu1, const double *mu2, const double T, const int N, const int M) {

int e = blockIdx.x * blockDim.x + threadIdx.x;
//int stride = blockDim.x * gridDim.x;

if ((e >= 0) && (e < M)) {

	double chinew = 0.;
	double wm, foo, faChi, fbChi, dx;

	for (int m = 0; m < N; m++) {

		wm = (2 * m + 1)*__pi*T;
		foo = 1. / (Z[m] * Z[m] * wm * wm + (x[0] + Chi[0]) * (x[0] + Chi[0]));
		faChi = 2. * T*(mu1[M*e] - mu2[M*e])*foo*(x[0] + Chi[0]);

		for (int e_prime = 1; e_prime < M; e_prime++) {

			dx = abs(x[e_prime] - x[e_prime - 1]);

			foo = 1. / (Z[m] * Z[m] * wm * wm + (x[e_prime] + Chi[e_prime]) * (x[e_prime] + Chi[e_prime]));

			fbChi = 2. * T*(mu1[M*e + e_prime] - mu2[M*e + e_prime]) * foo*(x[e_prime] + Chi[e_prime]);
			chinew += 0.5*dx*(faChi + fbChi);
			faChi = fbChi;
		}
	}
	Chi_new[e] = chinew;
	//printf("%d %f\n", e, chinew);
}

}

int main() {

int N = 150;
int M = 2000;
double T = 0.01;

double *Z, *Z_new, *Chi, *Chi_new, *g, *x, *mu1, *mu2;

cudaMallocManaged(&Z, N * sizeof(double));
cudaMallocManaged(&Z_new, N * sizeof(double));
cudaMallocManaged(&Chi, M * sizeof(double));
cudaMallocManaged(&Chi_new, M * sizeof(double));
cudaMallocManaged(&x, M * sizeof(double));
cudaMallocManaged(&g, (2 * N + 1) * sizeof(double));
cudaMallocManaged(&mu1, M*M * sizeof(double));
cudaMallocManaged(&mu2, M*M * sizeof(double));

for (int i = 0; i < N; i++) {
	double w = (2 * i + 1)*pi*T;
	double w2 = (2 * (N + i) + 1)*pi*T;

	w = w * w;
	w2 = w2 * w2;

	Z[i] = 1 + 1 / (1 + w);

	g[i] = 1 - w * log(1 + 1 / w);
	g[N + i] = 1 - w2 * log(1 + 1 / w2);
}

for (int i = 0; i < M; i++) {
	double x0 = 10. * i/ M;

	Chi[i] = sqrt(x0);

	x[i] = x0;

	for (int j = 0; j < M; j++) {
		double y0 = 10.*j / M;

		mu1[M*i + j] = 1 + 1 / (1 + sqrt(abs(x0 - y0)));
		mu2[M*i + j] = 1 + 1 / (1 + sqrt(abs(x0 + y0)));
	}

}

int threadsPerBlock = 128;
int numBlocks;

numBlocks = ceil((double)(N + threadsPerBlock - 1) / threadsPerBlock);
Iterate_Z_normal << <numBlocks, threadsPerBlock >> > (Z_new, Z, Chi, g, x, T, N, M);

cudaDeviceSynchronize();

numBlocks = ceil((double)(M + threadsPerBlock - 1) / threadsPerBlock);
Iterate_Chi_normal << <numBlocks, threadsPerBlock >> > (Chi_new, Z, Chi, x, mu1, mu2, T, N, M);

cudaDeviceSynchronize();

std::cout << Z[0] << " " << Z_new[0] << " " << Chi[1]/x[1] << " " << Chi_new[1]/x[1] << std::endl;

return 0;

}

If I compile this in debug mode and run, it outputs nothing. If I F5 through it (debugging), the program just crashes on the line with std::cout and says: “Exception thrown: read access violation. Z was 0x203120000.” Note, that Z wasn’t even changed in neither of kernels! It is passed as const double *Z, kernels couldn’t change it in any way :(

Moreover, if I debug with NSight Legacy, it runs for some time and then outputs 1.99901 1.45534 10 0.461093 (which are, by the way, reasonable values).

When I add errchecking at device synchronization lines, it outputs “GPUassert: unknown error” to the console for the second synchronization.

Can someone with knowledge in CUDA look into this piece of code and try to explain to me, why it fails and how to fix it? Or, at least, tell me how this code behaves on their machine, please?

When I run your code on CUDA 9.2, Tesla V100, CentOS 7, compiled for release mode, and run with cuda-memcheck, it outputs:

$ cuda-memcheck ./t1387
========= CUDA-MEMCHECK
1.99901 1.45489 14.1421 0.466419
========= ERROR SUMMARY: 0 errors
$

I get identical output on the same machine if I compile with device-debug switch (-G) and run on a Tesla K20Xm

If I run that debug build on the Tesla K20Xm with nvprof, I get the following output:

$ CUDA_VISIBLE_DEVICES="2" nvprof ./t1387
==26677== NVPROF is profiling process 26677, command: ./t1387
1.99901 1.45489 14.1421 0.466419
==26677== Profiling application: ./t1387
==26677== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   50.36%  3.32897s         1  3.32897s  3.32897s  3.32897s  Iterate_Z_normal(double*, double const *, double const *, double const *, double const *, double, int, int)
                   49.64%  3.28173s         1  3.28173s  3.28173s  3.28173s  Iterate_Chi_normal(double*, double const *, double const *, double const *, double const *, double const *, double, int, int)
      API calls:   96.33%  6.61079s         2  3.30540s  3.28178s  3.32901s  cudaDeviceSynchronize
                    3.49%  239.17ms         8  29.897ms  8.4600us  231.26ms  cudaMallocManaged
                    0.17%  11.500ms         2  5.7502ms  180.19us  11.320ms  cudaLaunchKernel
                    0.02%  1.1125ms        96  11.588us     344ns  472.03us  cuDeviceGetAttribute
                    0.00%  222.16us         1  222.16us  222.16us  222.16us  cuDeviceTotalMem
                    0.00%  125.95us         1  125.95us  125.95us  125.95us  cuDeviceGetName
                    0.00%  9.3970us         1  9.3970us  9.3970us  9.3970us  cuDeviceGetPCIBusId
                    0.00%  7.6940us         3  2.5640us     837ns  5.0060us  cuDeviceGetCount
                    0.00%  2.7510us         2  1.3750us     620ns  2.1310us  cuDeviceGet

==26677== Unified Memory profiling result:
Device "Tesla K20Xm (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      36  1.6966MB  4.0000KB  2.0000MB  61.07813MB  10.53005ms  Host To Device
     392  160.65KB  4.0000KB  0.9961MB  61.50000MB  9.914688ms  Device To Host
Total CPU Page faults: 195
$

notice that the kernels are taking longer than 2 seconds to execute
maybe you’re running into a WDDM TDR timeout

https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/timeout_detection_recovery.htm

If I build and profile the same in release mode, notice that the kernels take less than a second to execute:

$ CUDA_VISIBLE_DEVICES="2" nvprof ./t1387
==27141== NVPROF is profiling process 27141, command: ./t1387
1.99901 1.45489 14.1421 0.466419
==27141== Profiling application: ./t1387
==27141== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   78.19%  586.68ms         1  586.68ms  586.68ms  586.68ms  Iterate_Chi_normal(double*, double const *, double const *, double const *, double const *, double const *, double, int, int)
                   21.81%  163.61ms         1  163.61ms  163.61ms  163.61ms  Iterate_Z_normal(double*, double const *, double const *, double const *, double const *, double, int, int)
      API calls:   76.14%  750.30ms         2  375.15ms  163.60ms  586.70ms  cudaDeviceSynchronize
                   22.54%  222.10ms         8  27.762ms  8.5240us  214.35ms  cudaMallocManaged
                    1.13%  11.131ms         2  5.5654ms  122.37us  11.008ms  cudaLaunchKernel
                    0.12%  1.2252ms        96  12.762us     405ns  467.64us  cuDeviceGetAttribute
                    0.03%  311.01us         1  311.01us  311.01us  311.01us  cuDeviceGetName
                    0.03%  260.36us         1  260.36us  260.36us  260.36us  cuDeviceTotalMem
                    0.00%  27.034us         1  27.034us  27.034us  27.034us  cuDeviceGetPCIBusId
                    0.00%  6.9380us         3  2.3120us     497ns  4.7300us  cuDeviceGetCount
                    0.00%  2.5880us         2  1.2940us     468ns  2.1200us  cuDeviceGet

==27141== Unified Memory profiling result:
Device "Tesla K20Xm (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      36  1.6966MB  4.0000KB  2.0000MB  61.07813MB  10.37181ms  Host To Device
     392  160.65KB  4.0000KB  0.9961MB  61.50000MB  9.855872ms  Device To Host
Total CPU Page faults: 195
$

txbob: Thank you! I increased the TDR delay to 10 seconds as instructed, however, in debug mode I still get GPUassert: unspecified launch failure in both debug and release mode. How can I see the table you’re seeing in my MSVS 2017?

EDIT: I finally got it. In console I wrote: cuda-memcheck name-of-my-program.exe in the directory where the exe is. Now I’m getting:

========= CUDA-MEMCHECK
========= Error: process didn’t terminate successfully
========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuModuleGetSurfRef + 0x2d5952) [0x2e34db]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Release\TEST_crash.exe (cudaDeviceSynchronize + 0xf9) [0x10f9]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Release\TEST_crash.exe (main + 0x502) [0xe2b2]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Release\TEST_crash.exe (__scrt_common_main_seh + 0x11d) [0xe8f9]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x13034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x71431]

========= No CUDA-MEMCHECK results found

if compiled in release mode and

========= CUDA-MEMCHECK
========= Error: process didn’t terminate successfully
========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuModuleGetSurfRef + 0x2d5952) [0x2e34db]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (cudaDeviceSynchronize + 0xf9) [0x5099]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (main + 0x600) [0x52a20]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (invoke_main + 0x34) [0x56bc4]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (__scrt_common_main_seh + 0x127) [0x56a87]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (__scrt_common_main + 0xe) [0x5694e]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (mainCRTStartup + 0x9) [0x56be9]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x13034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x71431]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaLaunchKernel.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuModuleGetSurfRef + 0x2d5952) [0x2e34db]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (cudaLaunchKernel + 0x1fa) [0x1b30a]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (cudaLaunchKernel + 0x8f) [0x5361f]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (__device_stub__Z18Iterate_Chi_normalPdPKdS1_S1_S1_S1_dii + 0x32a) [0x533ca]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (Iterate_Chi_normal + 0x85) [0x52405]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (main + 0x734) [0x52b54]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (invoke_main + 0x34) [0x56bc4]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (__scrt_common_main_seh + 0x127) [0x56a87]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (__scrt_common_main + 0xe) [0x5694e]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (mainCRTStartup + 0x9) [0x56be9]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x13034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x71431]

========= Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuModuleGetSurfRef + 0x2d5952) [0x2e34db]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (cudaDeviceSynchronize + 0xf9) [0x5099]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (main + 0x739) [0x52b59]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (invoke_main + 0x34) [0x56bc4]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (__scrt_common_main_seh + 0x127) [0x56a87]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (__scrt_common_main + 0xe) [0x5694e]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Debug\TEST_crash.exe (mainCRTStartup + 0x9) [0x56be9]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x13034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x71431]

========= No CUDA-MEMCHECK results found

in debug mode. Is this output worth anything useful? I’m not sure I’m doing the same thing as you do in console…

It looks to me like you’re still hitting the TDR timeout. In the release mode, the last kernel launch. In the debug mode, the first kernel launch.

Your GPU may be slower than either of the ones I used. After you modify TDR settings, you may need to reboot. And you may not be running the code you have posted here.

This is not what the problem is I think…TDR is 10 seconds now that I modified and rebooted. When I launch the problem within VS in release mode now, it outputs the numbers in about 3 seconds. If I run cuda-memcheck on it, this is the oputut:

========= CUDA-MEMCHECK
========= Error: process didn’t terminate successfully
========= Program hit cudaErrorUnknown (error 30) due to “unknown error” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuModuleGetSurfRef + 0x2d5952) [0x2e34db]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Release\TEST_crash.exe (cudaDeviceSynchronize + 0xf9) [0x10f9]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Release\TEST_crash.exe (main + 0x502) [0xe2b2]
========= Host Frame:E:\Users\Rabatin\Documents\Visual Studio 2017\Projects\TEST_crash\x64\Release\TEST_crash.exe (__scrt_common_main_seh + 0x11d) [0xe8f9]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x13034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x71431]

========= No CUDA-MEMCHECK results found

The outcome is too random…sometimes I get the results, sometimes I don’t. I think the worst thing is, that even if it runs in release mode, it reports read access violation errors in debug mode, which I use…well, for debugging.

Is this CUDA thing even safe for calculating things (with this level of randomness)? Also, if TDR was the problem, how can I prevent my graphics card from stalling like that? Should like increase that timeout to something like two minutes?

If it helps, I have NVidia GeForce 970 GTX card with 1664 CUDA cores and 4GB video memory.

And yet you’re trying to do double precision computations with it. This particular device is 32 times slower computing with double values than it is computing float values.

If you can tolerate the lower precision floating point arithmetics, try it. When converting the code from double to float take special care of changing all double literals (i.e. PI) to floating point literals by appending an f. Otherwise you’d be triggering the use of double precision arithmetics again.

Christian

cuda-memcheck often slows down the execution of a CUDA kernel (even in release mode) by a factor of 10x or more.

cbuchner1: thanks, that’s very useful! I will need to lower the precision then. Too bad I was demanding 10^-10 precision of iterates (Z - Z_new)/(Z_new) should’ve been at most 10^-10 in order to be considered as converged.

txbob: looks like TDR is/was the culprit all from the start. For some reason, NSight won’t make the GPU stall, so windows doesn’t reset the videocard. In VS debug when I was past the cuda synchronization call, windows reset the videocard and somehow the whole unified memory got lost (so it must be stored also somewhere on the card). It all makes sense now! So the only solution is to raise TDR delay time? No chance not making the graphics card stall when making computations?

You may be able to split your single-shot computation into several calls to the same kernel, each call converges a bit further towards a possible solution. You would need to save the current state at the end of the kernel and resume from that state in the next invocation.

This way you can prevent this from hitting any timeouts, you could monitor the convergence of your computation (and possibly provide a progress indication and an option to abort early).

If you’re running into floating point precision issues, you can research numerical methods to reduce the accumulation of errors. I am not really qualified to give detailed advice, but there are certain techniques to minimize errors in the result (such as for example Kahan summation - not sure if it would apply here)

That’s exactly what I’m doing. I have a CPU function called Iterate(Z, Chi) (Z, Chi continue exist in unified memory to get on with calculations further) which repeatedly calls functions Iterate_Z and Iterate_Chi and then copy functions. But Iterate(Z, Chi) must know when to stop - I was comparing Z_new[0] with Z[0] to find the relative difference after each iteration. But if I’m only allowed to use float precision, then with the mere 23 digit mantissa I get about 1 + 10^-6, so the smallest “epsilon” that’s reasonable to use is 10^-5 or so. (and yes, if I use smaller epsilon, some loops might never end, like when I used epsilon = 10^-20 with double precision).