"ECC error" and device to host data transfer quest

nicolaprandi · December 7, 2011, 10:42am

Hello,

I’m working on a code which kernels are inside an outer do-while loop. After 270 steps, when I try to copy from the device to the host some of the arrays computed, I get the following error:

copyout Memcpy FAILED: 39<uncorrectable>

What does it mean? I haven’t found anything useful on Google about this error. If it could help, my device is a Tesla C2050.

Also, the code needs to copy a single double precision data calculated on the device to the host. If I try to export it for example every 10 cycles, the copy operation takes much longer than the single operation to export it every cycle. Which is the reason for that behaviour? Since the copy involves the same variable, I would expect to get similar times in order to complete the operation.

One last question: is there a way to improve the transfer speed of a single double precision data? On my program, it takes nearly as long as to export a 840*1000 double precision array: I expect it would take much less time.

Thanks in advance,

Nicola

MatColgrove · December 7, 2011, 4:05pm

Hi Nicola,

What does it mean?

The description in the CUDA header file “driver_types.h” states “this indicates that an uncorrectable ECC error was detected during execution.” What would cause such an error, I’m not sure so have sent a note to one of my contacts at NVIDIA for help. My best guess is that there’s some type of memory corruption occurring but I’m not sure why it would trigger an ECC error.

Have to tried turning ECC off? (“nvidia-smi -e 0”, then reboot). I doubt this would solve the problem but it might change the behavior so the error given is more meaningful.

Which is the reason for that behaviour? Since the copy involves the same variable, I would expect to get similar times in order to complete the operation.

I’m not sure. I would need to an example which reproduces the problem.

is there a way to improve the transfer speed of a single double precision data? On my program, it takes nearly as long as to export a 840*1000 double precision array: I expect it would take much less time.

This behavior is expected. There is a lot of overhead in transferring data so it often takes the same amount of time to copy one byte as it does to copy a megabyte.

One thing that might help is to add the “pinned” attribute to this variable’s declaration. All DMA transfers need to be made from pinned memory (a physical memory location that can’t be swapped out to a virtual address). The “pinned” attribute will request the OS to always keep the variable in pinned memory. This saves the creation of the pinned memory and extra host to host copy, thus reducing the overhead in copying the data. Note that the OS is not required to honor the request for pinned memory.

Mat

MatColgrove · December 7, 2011, 7:52pm

Hi Nicola,

My contact at NVIDIA wrote back saying “cudaErrorECCUncorrectable denotes a double-bit error” and most likely means you have a bad board.

Mat

nicolaprandi · December 10, 2011, 3:24pm

Thanks Mat for all your answers. Regarding the ECC error, I’ll try to figure out where the hardware problem is. (Yes, I tried to turn off the ECC flag but with no luck.)

As for the code needed to reproduce the problem of the copyout every n cycles, on monday I’ll try to make a working example with some parts of the code I’m working on (I can’t send you the whole code since it’s a commercial one, I’ll ask what I may do to my boss).

Thanks again and forgive my late answer, here in Italy there are holidays in these day.

Have a nice day,

Nicola

nicolaprandi · December 12, 2011, 4:00pm

Hi Mat,

I removed some parts of the code (so my boss is fine with sharing it) and it still has that strange behaviour regarding the export of a single double precision data from the device to the host. In particular, in the main_code.f90, you can change the value of i_dbg in order to increase the number of iterations:

If it’s equal to a small number like 3 or 5, the export operation is kinda fast;
If it’s equal to a greater number like 40 or 50, the export operation is really slow.

I’ve uploaded the VS2008 project on MediaFire:

http://www.mediafire.com/?w3mn543rx2sd36o

Thanks in advance,

Nicola

MatColgrove · December 14, 2011, 6:40pm

Hi Nicola,

Can you please give more details on the problem?

When I run your program the output is approximately 1 when i_dbg is set to 5, and 8 when it’s set to 40. Since i_dbg gathers the total amount of kernel time, having the time be 8 times more when you increase the number of iterations by 8, makes sense.

Mat

nicolaprandi · December 15, 2011, 8:02am

Actually, the time “t” displayed as output is the value computed by the program in order to predict the evolution of the solid transport. What I was talking about is the time required to copy “t” 's value from the device to the host.

I’ve uploaded a modified versione of the program which includes a timing for the copyout operation (called “copyout_time”).

http://www.mediafire.com/?xaymuvavzopymup

I would expect the time required to copy a single double precision value from the device to the host to be almost the same after 5 cycles or after 40 cycles (the dimension of the data should be the same).

Also, the times to export the variable “time_dev” are very large if compared to the kernel’s exection time. Shouldn’t they be smaller? If you run the code on your PC, what times (to copy the time value from the device to the host) do you get?

Thanks again,

Nicola

MatColgrove · December 15, 2011, 5:31pm

Hi Nicola,

Kernel calls are asynchronous (i.e. the host code continues to execute after a kernel is launched). The host doesn’t block until the “time=time_dev” data copy is reached. Hence, you’re not timing the data copy but rather the kernel execution time plus the data transfer.

To fix, add a call to “cudaThreadSynchronize” before you start the data transfer timer.

if(index==i_dbg) then
        ierr = cudaThreadSynchronize()
        call system_clock(start_copyout)

        time = time_dev

        call system_clock(end_copyout)

Before the change:

****** DEBUG DT *******************
 t:                             8.700117769630859     
 Time required for copyout:     16.58562088012695

After the change:

 ****** DEBUG DT *******************
 t:                             8.700117769630859     
 Time required for copyout:    2.9000000722589903E-005

A better method to determine GPU times is to use CUDA event timers or profiling. In this case, if I set the environment variable CUDA_PROFILE to 1, we can see the actual time of the data transfer.

The original code has the profile of:

method=[ memcpyDtoH ] gputime=[ 1.952 ] cputime=[ 7859871.000 ]

So the actual GPU time is only a few microseconds but the CPU time is nearly 8 seconds. In other words, the call was blocked waiting for the kernels to finish. Note that the profiling itself is periodically blocking the host code accounting for the ~9 second difference.

The modified profile shows near identical GPU time, but now only the CPU time to transfer the data.

method=[ memcpyDtoH ] gputime=[ 1.984 ] cputime=[ 16.000 ]

Hope this explains things,
Mat

nicolaprandi · December 16, 2011, 10:32am

Hi Mat,

thanks for your answers. Since the slow-down is not due by the device to host transfer, I tried to profile the code.

The PGI Profiler is meant for codes developed with Accelerator clauses, so I downloaded the nVidia CUDA Visual Profiler. I tried several times to profile the .exe obrained in “Release mode” (VS2008) with no success. In particular, I did the following steps:

Compile the code;
Put the .exe file inside a folder with the input datas and the cudart64_32_16.dll (if the last file is missing, I get an error);
Start the Profiler;
Either “Create…” or “Profile Application…”;
Locate the .exe file in the “Launch:” field;
Remove the “Max Execution Time” value.

What I got is:

Most of the times, the video drivers crash (I’m on Windows 7 x64 SP1 with 276 drivers);
Once or twice, I manage to complete the 15 steps (even if the video driver crashes 2 or 3 times) but only the first 4 or 5 kernels get profiled.

Where did I go wrong? I uploaded the compiled file on MediaFire:

http://www.mediafire.com/?6sp678c2ct6ojjm

Instead of nVidia Profiler, should I use some cuda events in order to profile the program?

Thanks (again) in adavnce,

Nicola

MatColgrove · December 21, 2011, 6:40pm

Hi Nicola,

The PGI Profiler is meant for codes developed with Accelerator clauses

No, the profiler can be used CUDA Fortran. How are you collecting the profile information? From a PGI shell, try running the “pgcollect” utility with the “-cuda” flag.

From pgcollect -help:

Profiling of Accelerator/GPU Events:
-cuda[=gmem|branch|cfg:<cfgpath>|cc10|cc11|cc12|cc13|cc20|list]
                    Collect performance data from CUDA-enabled GPU
    gmem            Global memory access statistics
    branch          Branching and Warp statistics
    cfg:<cfgpath>   Specifies <cfgpath> as CUDA profile config file
    cc10            Use counters for compute capability 1.x
    cc11            Use counters for compute capability 1.x
    cc12            Use counters for compute capability 1.x
    cc13            Use counters for compute capability 1.x
    cc20            Use counters for compute capability 2.0
    list            List cuda event names used in profile config file
-cudainit           Initialize CUDA driver to eliminate overhead from profile

so I downloaded the nVidia CUDA Visual Profiler.

Sorry I’ve never used NVIDIA’s profiler, so don’t really know how to solve this problem. Hopefully someone else can step in.

Mat