Unpredictable memory behavior

I am having problems understanding how memory behaves.

What my program does is not important, but I will just say that I have developed a kernel that multiplies a banded matrix with three diagonals on both sides of the main diagonal. Essentially is a custom sparse matrix-by-vector multiplication with seven nonzeros in each row.

After having tested my code and while taking some measurements my program stopped running correctly, and the weird thing is that it stopped working for matrices of big sizes. When I say “stopped working” I mean I was getting NaN as a result. I am talking about the same working code getting “corrupted” after some runs.

The biggest matrices I want to work with have 512000 lines (and since there are 7 nonzeros in each line there is enough memory for 512000x7xsizeof(float) on a Tesla C870).

While trying to see how my code behaves for smaller matrices I realize that it works for matrices of 4096 lines, and if I increase gradually the size of the matrix from run to run until reaching 512000 I can “make” my program to run correctly. After reaching the 512000 matrix size my code works fine (at least for now :) ).

I had the same problem with other kernels, and, again, I found out that by increasing gradually the size of the matrices I can “make” my code to run for the desired matrix size.

I free the memory after the end of the code and I initialize the memory to the desired value.

So, I want to ask if anybody has encountered the same problem, and, if anybody can explain this behavior.

Thanks,
Panos

P.S. I could post my code is somebody wants to look at it.

not sure if it is relevant, but i too have been working on matrix’s and recently had a problem due to matrix size. My problem was that my calculations on a matrix stopped working properly on a 11000x11000 matrix. I worked out that, when allocating memory for this matrix, it was taking up 11000x11000x4(float)/1024/1024 = 461mb (approx).

Allocating anything bigger than 11000x11000 stopped my calculations working simply cause there was not enough space (the card was reserving the rest of the memory for screen operations).

The morale of the story is: check to make sure that the card that you are using has enough space for your cudaMalloc of the size you need.

For me this sounds like a memory access or initializing problem. Maybe by increasing the size of the matrix each step you “initialize” a memory location that is used for calculation but not initialized by the iteration you are now doing. And If you don’t gradually increase the matrix size you don’t do this kind of fake initializing and therefore get NaNs (which most likely refer to calculation on uninitialized values). I had a similar problem in one of my kernels and realized that my memset-size was too small for memset being a byte-wise operation. Maybe you should chekc your code for something like this. I hope you understand what I mean. And yes, posting code might help others to find the problem.

Vrah

If you are running on linux, compile in emulation debug mode and run your app through valgrind: it is an invaluable tool for debugging memory initialization / out of bouds issues.

I used Valgrind to identify where I was accessing memory locations out of bounds and I think I have corrected my code now. The funny thing is that valgrind even found errors in cublas routines as well (I don’t know if these where false positives). Thanks.

But, the memory behavior of CUDA is not the same as that of a CPU and this, many times, made me wonder what’s going on.

For example, I had cases where I was writing over the bounds in a vector in global memory and this was resulting in zeroing completely other vectors that were residing in memory. And although that problem would give “Segmentation Faults” in emulations mode, it would run it on the GPU (not correctly of course).

Well, what are you expecting? I don’t think the GPU can overlook any memory access and compare to anything initialized in this high parallelism stage. If you’re lucky you get an “unspecified launch failiure” for the next kernel after the errornous one. And even this only happens if you manage to overwrite and kill some heaps or other sensitive data. If you write only 1 or 2 elements out of your bounds it might happen that (running on the device) you’ll never occur any problems, or more likely see a problem the moment you least expect it ;)