cudbgprocess

hello,

this should probably be under the cuda-gdb board, but since geoffg no longer attends that board, said board’s response rate moved from 1 month or so, to infinity

i am attempting to debug a rather lengthy project at the process level
it may run numerous iterations, an iteration may take a number of seconds, even a minute, to complete
at this point i am trying to debug the steering logic responsible for evaluating an iteration’s result, and setting up the next iteration
i hit about 15 iterations, without being done, before the machine runs out of memory and becomes non-responsive

i am running the project in the debugger, as i have a number of breakpoints i wish to follow
the system monitor (linux) lists the application, as well as cudbgprocess
the memory footprint of the application itself generally remains constant from iteration to iteration
the memory footprint of cudbgprocess grows at seemingly a constant rate per iteration

i am aware of the fact that the debugger seems to build and compile some form of trace or log in the background - i remember crashing the debugger once, and it took forever for the debugger to dump its log into an output file

does the above sound like a debugger-induced instance, or a potential application memory leak?
in the case of the former, is there any way to disable the debugger trace/ log building?

many thanks

running the program with valgrind does not really report errors

running the debug version outside the debugger does not inflate the memory footprint

hence, at this point my preliminary conclusion is that the debugger itself is causing the unsustainable memory footprint growth, either by leaking, or by compiling an impractically sized log/ trace/ report

are you using the latest cuda version?

If so, and you can develop a fairly direct set of instructions to reproduce your observation, you may want to file a bug at developer.nvidia.com

i realize that you likely intend no harm, but i wonder whether i can respond without coming across as offensive
after this debacle, i am truly not pleased, and it is likely going to take a few days to calm my mind, and think purely rational again

“are you using the latest cuda version?”

i use 6.5
i use double precision - with my latest project i make calls around 1e-10
maxwell thus does not seem that appealing
perhaps i would consider moving to 7.5 more strongly when allanmac stops complaining about 7.5’s apparent policy of spilling
this also assumes that there is a significant difference in the debugger of 7.5 versus that of 6.5; perhaps i should check the manual/ change log to verify this

“develop a fairly direct set of instructions”

to what end?
the project commences by reading from a database; already that is an obstacle in terms of reproduction, to some extent
the issue occurs likely because the project is minimally ‘enormous’ in scope; that too is an obstacle in terms of reproduction
even if i file a bug, there is no guarantee that it would get priority and thus attention
i have been raising issues about the debugger beforehand
scottgray too has been complaining (or at least raising) anomalies or broken content from previous versions, still not fixed
considering the above, i find that i have to consider the opportunity cost of my time, and that i am poorly incentivized to file a bug, keeping in mind that nvidia likely does the same: consider the opportunity cost of their time

is nvidia’s software development team over-stretched, would you say?
i think so
the software circle you (can) draw around new architectures and packages is ever-increasing; i wonder if the software development team is ever increasing

cuda maestro txbob, are you there?

i must have completely lost my mind - i actually took the time to build a reproduction case

if i sent you the files, would you test if you too can reproduce the incident?

Yes, I’m here. The reason I suggested trying the latest CUDA version is not to suggest that you must switch to it against your will, but because it is the first thing I would do and it is the first thing NVIDIA QA would do upon receipt of your bug report. If it is fixed in a newer CUDA version (bugs do get fixed all the time), then nobody is going to spend any time on it (sorry).

If you go through the bug submission process and give me the bug number, I can look at it that way. But if it is complex to setup/observe, I can’t make any guarantees about the time I can spend looking at it. Another limiting factor for me is if it requires some special setup (like it only happens on SLES10, for example). In that case, I may or may not go through the trouble of setting up a SLES10 system. And likely the end result of my inspection would just be “yes I can see the problem” or “no, I can’t see the problem”. Any more meaningful interaction would probably have to come from the NVIDIA dev team.

you could have been a lawyer with all your disclaimers

i have been dual (multi) booting for some time, but have not run multiple versions of cuda with this scheme
i have now set up 7.5 as part of another boot, and the incidence persists
fedora + cuda 6.5 vs opensuse + cuda 7.5

decide for yourself whether the following is too much:

you can drop the headers and source files in a new project
the code expects the data to be in a separate directory /s_dump/temp_bin/data
alternatively, the location of the data is specified right at the top of header file: tsd_const_v4.cuh

to observe the splendid phenomenon:
add a breakpoint to line 102

lint[2] = 0;

of

mainA.cu

the program will run for some time - a number of seconds
nothing much would happen with the program itself
however, if all goes well, if you open the system monitor, you would note the system memory consumption steadily increasing, and cudbgprocess under ‘processes’, steadily increasing its footprint

to build the reproduction case, i took (chopped off) a section of the program, and placed it in an infinite loop essentially - this is a proper approximation of the grand project
the breakpoint is pointing to a counter that resets after i think 500 runs
running the section multiple times approximates an iteration of the grand project, and running it infinite times also approximates the steering logic of the grand project, in that the latter controls termination, and may be time consuming
this should demonstrate that it is seemingly not possible to run a large project too long within the debugger
and this is hardly funny, as i now essentially need to debug from meta data - debug by data analysis

i shall forward the files in a private mail

have sent the files
for some reason i could not label the message - it auto labeled itself as ‘dd’
i find that grandiose of course

you might need to continue (f8) the project once it hits the breakpoint
every time you continue, the program will again run a number of seconds, and you can note the buildup of the memory footprint

I put the data files in a directory ./s_dump/temp_bin/data which is referenced from the directory where I have the source code. I modified the tsd_const_v4.cuh file so that each of the (6) filenames had an additional “.”, like so:

#define in_arr_dump_filenameissAn "./s_dump/temp_bin/data/issn.txt"

I put all source and header files in the same directory. I built the code like so:

nvcc -o test -lineinfo -g -G algo_main_aux.cpp algo_main_funct.cpp krnlA.cu krnlB.cu krnlC.cu mainA.cu mainB.cu tsd_main_com.cu

When I run the code, it seg faults.

If I set that breakpoint in mainA.cu, it is not hit. Line 102 of mainA.cu appears to be this:

if (lint[2] == 1000)  // this is line 102
                {
                        lint[2] = 0;

Here is my cuda-gdb session:

$ cuda-gdb ./test
NVIDIA (R) CUDA Debugger
7.5 release
Portions Copyright (C) 2007-2015 NVIDIA Corporation
GNU gdb (GDB) 7.6.2
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/bob/misc/junk1/V1/test...done.
(cuda-gdb) break mainA.cu:102
Breakpoint 1 at 0x406315: file mainA.cu, line 102.
(cuda-gdb) run
Starting program: /home/bob/misc/junk1/V1/./test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff701c700 (LWP 4402)]
[New Thread 0x7ffff5beb700 (LWP 4403)]

Program received signal SIGSEGV, Segmentation fault.
0x000000319b28040c in free () from /lib64/libc.so.6
(cuda-gdb)

I tried an absolute path instead of a relative path for the filenames, and it still seg faults. It seems to seg fault in the data loading area. It seems to be segfaulting on the read of sol.txt, which is the last of the files to be read. It reads the file for some time but eventually seg faults within tsd_dbl_arr_read_from_file.

The seg fault occurs upon executing the ifile.close() statement at the end of tsd_dbl_arr_read_from_file when reading the last file (sol.txt).

Running your code with valgrind provides, in part, this useful output:

==4825== Invalid write of size 8
==4825==    at 0x403804: tsd_dbl_arr_read_from_file(int&, double*, char const*) (algo_main_aux.cpp:207)
==4825==    by 0x4083D4: tsd_read_input_arr_from_file(TSD_data*, CUstream_st*) (mainA.cu:828)
==4825==    by 0x4060EA: main (mainA.cu:35)

After reviewing your tsd_dbl_arr_read_from_file function, it seems that it has no range checking. It will continue to read input elements, possibly writing beyond the end of the input buffer. This seems to be what is happening. Instrumenting the code to print out the allocated size for the buffer:

tsd_data->h_base_comb = new double[tsd_data->coeff_cnt];
        printf("coeff_cnt: %d\n", tsd_data->coeff_cnt);
        tsd_dbl_arr_read_from_file(lint[0],
                tsd_data->h_base_comb, in_arr_dump_filenameC);

yields an output of 49 (according to my observation). Whereas the actual cnt value after the function (tsd_dbl_arr_read_from_file) is complete yields a value of 210 for the given file (sol.txt). The 210 number appears to be in agreement with the number of doubles in sol.txt

Due to this overrun of the buffer, I believe some form of data corruption is occurring, and I believe this data corruption is the proximal reason for the seg fault on ifile.close().

hits [1230]!

a) reading the data from file is not core design, and only incorporated for purposes of creating a reproduction case
b) it tested fine on my machine; hence, i had no reason for concern
c) i would hardly consider manipulating strings my speciality

what are you running?
i ran it on opensuse; let me test this on fedora quickly
i could also catch errors on file.close perhaps, to make it more fail safe

i think i have found the problem

txbob, if you change mainA.cu line 826 from

tsd_data->h_base_comb = new double[tsd_data->coeff_cnt];

to

tsd_data->h_base_comb = new double[algo_dat->comb_pnt_n];

as per mainA.cu line 831

if (lint[0] != algo_dat->comb_pnt_n)

this is an artifact from the grand project

both variables point to the same value; however, you would find that the value of the former has not been set yet at that point
also, this would mean that its value would be machine dependent - depending on what the value of ‘nothing’ is

the segmentation fault originates from no guards on the arrays passed to the functions reading the data from the data files
as long as the functions find valid values within the file, it would continue to append these to the input array passed to it
the size of the input arrays are predetermined, and assigned beforehand
a segmentation error can then easily occur, if there is a discrepancy between valid data in the file, and the size of the input array
i could implement guards, but the phenomenon points to an input error - it essentially should not occur, such that the occurrence of a segmentation fault is helpful, and the guards are superfluous

there are 3 array sizes, and the function that reads the data files check against these

algo_dat->issues_n = 27
algo_dat->tot_iss_pnts = 422
algo_dat->comb_pnt_n = 211

if you find these values in the function that reads the data files, you should be fine, i believe

and my reference to mainA.cu line 102 is wrong
you need a breakpoint on line 104, not 102

i apologize for the trouble

I was able to reproduce the observation, and I have filed a couple bugs internally (you may wish to reference 1708953 and 1708956 if you file your own bug at developer.nvidia.com). I don’t have anything further to report (at this time, I can’t explain it, I don’t know of any workarounds, and I can certainly agree it is undesirable behavior). If additional information comes to light through the bugs that I can share externally, I will report back here.

(I was running on fedora 20, since you asked.)

The developers seem to have confirmed a memory leak in a component associated with the debugger. This will be evident basically on a large or unbounded number of kernel launches in the application. I cannot make any commitments about schedule for a fix (being the lawyer that I am), but I would not expect a fix before CUDA 8.0. Thanks for the report.

As a possible workaround, limit the number of kernel launches that is required to satisfactorily debug your application, if possible.

a well-connected lawyer at that - with plenty a developer at your fingertips
developers clearly jump when you speak, but do not even blink otherwise

“limit the number of kernel launches that is required to satisfactorily debug your application”
chuckle-chuckle-chuckle
that is like saying only eat once a month
i think i am going to stick with debugging from meta data - data dumps

does this now mean you get the t-shirt with the big bug icon on it (similar to the nsight debugger icon)?

you have been a great help; many thanks, txbob

PS: you think you can get the developers to fix the moving breakpoints too?
i have mentioned this on the gdb board, but a) i am not sure geoffg officially noted a bug report, b) geoffg told me he no longer works on gdb

txbob…?

do you know whether the issue has been addressed in cuda 8.0 rc?

i can’t find a change log on c 8.0 rc so far

i now sit with an even larger program that does even more repetitive kernel launches; this time trying to debug from dumps prove to be more troublesome

i get a segmentation fault some time into the program, with little means to trace it

The two issues I mentioned (1708953 and 1708956) have not been fixed in CUDA 8RC.

If you file your own bugs, and ping those bugs when you want a status update, you may get better/faster attention. You can create fairly simple bugs just referencing this forum posting and the two bug numbers I have listed above.

noted, thanks

in your view, would it be too much to expect the mentioned bugs to be addressed in c 8.0 - the actual version?

I think it’s unlikely. I believe both are internally marked (at the moment) as fix in CUDA 8.5. This happens when the developers have a resource issue, and go through a bug prioritization process. For whatever reason, these bugs weren’t prioritized highly enough at the current time against other “must fix” bugs for CUDA 8.0.

My suggestion would be to increase the “noise” around those, if you care a lot about them, and one way to do that is to file your own bugs and ping those from time to time.

The squeaky wheel gets the grease, or so I’ve heard.

No promises, no guarantees, YMMV, I can’t predict the future, and hereafter follows the usual reading of all the lawyerly disclaimers…

Sorry for the slow progress.