Cuda application crashes works fine for small data and crashes for big data

Hello,

I would like to get some help with the following problem.

I am writing an application for gpu using NVidia cuda (nvcc compiler).

The problem is the following:

I have a computational grid with a number of nodes and points. For each point I need to
compute a matrix which is related to nlimit closest nodes, where nlimit is
a user defined integer number.

The number of points can be realy big.

For each point I need to invert a matrix which is the product of several
matrix - vector and matrix - matrix operations. After the inversion is done I
do further operations with matrices. The size of the matrices is small NxN where
N is in the range [3-300]. In the application every thread is associated with a single
or more points which means that every thread performs a matrix inversion of small size.

To do these matrix operations additional memory (work) memory is allocated on the device and
sent to kernel.

However I am unable to operate on big number of data. When I do it my screen
flickers (flushes or something) and I obtain an “unspecified launch failure”.

I used cuda-memcheck but it didn’t find any error in the code.

The biggest ammount of data that I have reached are 200 points and nodes and the limit
is about 50. For limit 150 the application fails with the aforementioned flickering of the screen.
However the memory requirements for limit 150 are very small (only 4.12445 mb).

I am posting the code that fails to run properly. It consists of two files,
kernel.cu and main.cpp.
Please notice that no local memory is allocated within the kernel. Working memory is supplied as argument.

I am using windows, the available ram is 32.0 Gb and from
task manager I have the following numbers:

Dedicated GPU is 3.0 Gb
GPU memory is 19.0Gb
Shared GPU memory 16.0Gb.

The identification of the graphics card is the following:

Device 0: “GeForce GTX 670MX”
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3072 MBytes (3221225472 bytes)
( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores
GPU Max Clock rate: 601 MHz (0.60 GHz)
Memory Clock rate: 1400 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 393216 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

the cuda-memcheck gives me the following diagnostic. This diagnostic does not appear when nlimit = 50;

“c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\cuda-memcheck.exe” --leak-check full cuda_post.exe
========= CUDA-MEMCHECK
allocated memory on device 3.69648e+07 bytes 36098.4 kb 35.2524 mb 0.0344262 gb
unspecified launch failure
cudaStatus != cudaSuccess (failure) in file ./cuda_post/cuda_post/cuda_post/kernel.cu line 170
========= Program hit cudaErrorLaunchFailure (error 719) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2DAsync + 0x2fa194) [0x3088cb]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (cudaDeviceSynchronize + 0xf8) [0x3c838]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (kernel_interface + 0x547) [0x93d57]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (calc_rbf + 0x3e5) [0x94425]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (run_data + 0x19b) [0x9dbcb]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (main + 0x2f) [0x9df5f]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (invoke_main + 0x34) [0x9f654]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (__scrt_common_main_seh + 0x12e) [0x9f53e]

Any help is appreciated.

main.cpp and kernel.cu files follow:

#include <cstdlib>
#include <iostream>
#include <iomanip>
#include <cstdio>
#include <vector>
static void run_data();
void calc_rbf(const size_t &numBlock, const size_t &numThread, const size_t &nsp, const size_t &num_npoint, const double *points, const size_t &num_node, const std::vector<double> &nodes, const size_t &nlimit, size_t &act_node, double **shp);
int main() {
    run_data();
}
static void calc_grid(const double &len, const size_t &nsp, const size_t &npt, double *pt) {
    size_t i, j, k, lnd;
    double dl;
    dl = len / (npt - 1.);
    for (i=0;i!=npt;i++) { // x
        for (j=0;j!=npt;j++) { // y
            for (k=0;k!=npt;k++) { // z
                lnd = (i * npt + j) * npt + k;
                pt[lnd * nsp + 0] = i * dl;
                pt[lnd * nsp + 1] = j * dl;
                pt[lnd * nsp + 2] = k * dl;
            }
        }
    }
}
static void run_data() {
    const size_t LIMIT_PRINT = 3;
    const size_t NSP = 3;
    const size_t NUM_NODES = 200, TOT_NODES = NUM_NODES * NUM_NODES * NUM_NODES;
    const size_t NUM_NPOINT = 200, TOT_NPOINT = NUM_NPOINT * NUM_NPOINT * NUM_NPOINT;
    const double LEN = 1.0;
    size_t i, j, k, nlimit, lnd, act_node, ipt, ind;
    size_t numBlock, numThread;
    std::vector<double> nodes(TOT_NODES * NSP);
    double *points = 0, *shp = 0;
    points = new double [TOT_NPOINT * NSP ];
    nlimit = 50;
    calc_grid(LEN, NSP, NUM_NODES, &nodes[ 0 ]);
    calc_grid(LEN, NSP, NUM_NPOINT, points);
    numBlock = 1;
    numThread = 256;
    calc_rbf(numBlock, numThread, NSP, NUM_NPOINT, points, NUM_NPOINT, nodes, nlimit, act_node, &shp);
    std::cout<<std::setprecision(8)<<std::scientific;
    for (ipt=0;ipt!=NUM_NPOINT;ipt++) {
        if ( ipt <LIMIT_PRINT || NUM_NPOINT - ipt < LIMIT_PRINT ) {
            for (ind=0;ind!=act_node;ind++) {
                if ( ind <LIMIT_PRINT || act_node - ind < LIMIT_PRINT ) {
                    std::cout<<std::setw(4)<<ipt<<std::setw(4)<<ind<<std::setw(16)<<shp[ipt * act_node + ind]<<"\n";
                }
            }
        }
    }
    delete [] shp;
    delete [] points;
}
#include <vector>
#include <cstdlib>
#include <cstdio>
#include <iostream>
#include <cmath>
#include <algorithm>
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
__device__ double wfunc(const size_t &nsp, const double *a, const double *b) {
    const double dk = 2.0, dcon = 0.01, dpw = 0.5;
    double wf, d;
    size_t i;
    d = 0.0;
    for (i=0;i!=nsp;i++) d += pow(a[ i ] - b[ i ], 2.0);
    d = sqrt(d);
    wf = pow( (pow(d, dk) + dcon), dpw);
    return wf;
}
__global__ void rbfKernel(const size_t nsp, const size_t num_npoint, const double *point, const size_t act_node, const double *nd_coord, double *wrk_mat, double *wrk_vec, double *shp) {
    size_t ipt, offset, ind, jnd;
    int index = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = blockDim.x * gridDim.x;
    for (ipt=index;ipt <num_npoint;ipt+=stride) {
        offset = ipt * act_node;
        double *wmat = &wrk_mat[ offset ];
        for (ind=0;ind!=act_node;ind++) {
            for (jnd=0;jnd!=act_node;jnd++) {
                wmat[ind + jnd * act_node] = wfunc(nsp, &nd_coord[(ipt * act_node + ind) * nsp], &nd_coord[(ipt * act_node + jnd) * nsp]);
            }
        }
        for (ind=0;ind!=act_node;ind++) {
            for (jnd=0;jnd!=act_node;jnd++) {
                wmat[ind + jnd * act_node] = 1.0 / wmat[ind + jnd * act_node];
            }
        }
        double *wvec = &wrk_vec[ offset ];
        for (ind=0;ind!=act_node;ind++) {
            wvec[ ind ] = wfunc(nsp, &point[ipt * nsp], &nd_coord[(ipt * act_node + ind) * nsp]);
        }
        double *wshp = &shp[ offset ];
        for (ind=0;ind!=act_node;ind++) wshp[ ind ] = 0.0;
        for (ind=0;ind!=act_node;ind++) {
            for (jnd=0;jnd!=act_node;jnd++) {
                wshp[ ind ] += wvec[ jnd ] * wmat[ind + jnd * act_node];
            }
        }
    }
}
void check_cuda_status(const cudaError_t &cudaStatus, const char *file, const size_t &line) {
    if (cudaStatus != cudaSuccess) {
        std::cout<<cudaGetErrorString(cudaStatus)<<"\n";
        std::cout<<" cudaStatus != cudaSuccess (failure) in file "<<file<<" line "<<line<<"\n"; 
        exit(1);
    }
}
void kernel_interface(const size_t &numBlock, const size_t &numThread, const size_t &nsp, const size_t &num_npoint, const double *points, const size_t &act_node, const double *nd_coord, double *shp) {
    size_t req_nd, req_pt, req_mat, req_vec, req_shp;
    float tot_mem;
    double *dev_point = 0, *dev_nd_coord = 0, *dev_wrk_mat, *dev_wrk_vec, *dev_shp;
    req_pt = num_npoint * nsp * sizeof(double);
    check_cuda_status( cudaMalloc((void **)&dev_point, req_pt), __FILE__, __LINE__);
    req_nd = num_npoint * act_node * nsp * sizeof(double);
    check_cuda_status( cudaMalloc((void **)&dev_nd_coord, req_nd), __FILE__, __LINE__);
    req_shp = num_npoint * act_node * sizeof(double);
    check_cuda_status( cudaMalloc((void **)&dev_shp, req_shp), __FILE__, __LINE__);
    req_mat = num_npoint * act_node * act_node * sizeof(double);
    check_cuda_status( cudaMalloc((void **)&dev_wrk_mat, req_mat), __FILE__, __LINE__);
    req_vec = num_npoint * act_node * sizeof(double);
    check_cuda_status( cudaMalloc((void **)&dev_wrk_vec, req_vec), __FILE__, __LINE__);
    tot_mem = req_pt + req_nd + req_mat + req_vec;
    std::cout<<"allocated memory on device "<<tot_mem<<" bytes "<<(tot_mem / 1024.)<<" kb "<<(tot_mem / (1024. * 1024.))<<" mb "<<(tot_mem / (1024. * 1024. * 1024.))<<" gb\n";
    check_cuda_status( cudaMemcpy( dev_point,    points,   req_pt, cudaMemcpyHostToDevice), __FILE__, __LINE__);
    check_cuda_status( cudaMemcpy( dev_nd_coord, nd_coord, req_nd, cudaMemcpyHostToDevice), __FILE__, __LINE__);
    rbfKernel<<<numBlock, numThread>>>(nsp, num_npoint, dev_point, act_node, dev_nd_coord, dev_wrk_mat, dev_wrk_vec, dev_shp);
    check_cuda_status( cudaGetLastError(), __FILE__, __LINE__ );
    check_cuda_status( cudaDeviceSynchronize(), __FILE__, __LINE__ );
    check_cuda_status( cudaMemcpy(shp, dev_shp, req_shp, cudaMemcpyDeviceToHost), __FILE__, __LINE__ );
    check_cuda_status( cudaFree(dev_point), __FILE__, __LINE__);
    check_cuda_status( cudaFree(dev_nd_coord), __FILE__, __LINE__);
    check_cuda_status( cudaFree(dev_shp), __FILE__, __LINE__);
    check_cuda_status( cudaFree(dev_wrk_mat), __FILE__, __LINE__);
    check_cuda_status( cudaFree(dev_wrk_vec), __FILE__, __LINE__);
    cudaDeviceReset();
}
double distance(const size_t &nsp, const double *a, const double *b) {
    size_t i;
    double d;
    d = 0.0;
    for (i=0;i!=nsp;i++) d += pow(a[ i ] - b[ i ], 2.0);
    d = sqrt( d );
    return d;
}
struct closest_node {
    size_t knd_;
    double dist_;
};
bool sort_criterio(const closest_node &a, const closest_node &b) {return a.dist_ < b.dist_;}
void calc_rbf(const size_t &numBlock, const size_t &numThread, const size_t &nsp, const size_t &num_npoint, const double *points, const size_t &num_node, const std::vector<double> &nodes, const size_t &nlimit, size_t &act_node, double **shp) {
    size_t ipt, ind, isp;
    double d;
    std::vector<closest_node> neigh_node;
    double *nd_coord = 0;
    act_node = std::min(num_node, nlimit);
    nd_coord = new double [num_npoint * act_node * nsp ];
    for (ipt=0;ipt!=num_npoint;ipt++) {
        neigh_node.clear();
        for (ind=0;ind!=num_node;ind++) {
            d = distance(nsp, &points[ipt * nsp], &nodes[ipt * nsp]);
            closest_node cln;
            cln.knd_ = ind;
            cln.dist_ = d;
            neigh_node.push_back( cln );
        }
        std::sort(neigh_node.begin(), neigh_node.end(), sort_criterio);
        for (ind=0;ind!=act_node;ind++) {
            for (isp=0;isp!=nsp;isp++) {
                nd_coord[(ipt * act_node + ind) * nsp + isp] = nodes[ neigh_node[ ind ].knd_ * nsp + isp ];
            }
        }
    }
    *shp = new double[ num_npoint * act_node ];
    kernel_interface(numBlock, numThread, nsp, num_npoint, points, act_node, nd_coord, *shp);
    delete [] nd_coord;
}

This is a fairly typical “signature” for a WDDM TDR timeout.

It worked. Thanks.