Cuda application crashes works fine for small data and crashes for big data

Hello,

I would like to get some help with the following problem.

I am writing an application for gpu using NVidia cuda (nvcc compiler).

The problem is the following:

I have a computational grid with a number of nodes and points. For each point I need to
compute a matrix which is related to nlimit closest nodes, where nlimit is
a user defined integer number.

The number of points can be realy big.

For each point I need to invert a matrix which is the product of several
matrix - vector and matrix - matrix operations. After the inversion is done I
do further operations with matrices. The size of the matrices is small NxN where
N is in the range [3-300]. In the application every thread is associated with a single
or more points which means that every thread performs a matrix inversion of small size.

To do these matrix operations additional memory (work) memory is allocated on the device and
sent to kernel.

However I am unable to operate on big number of data. When I do it my screen
flickers (flushes or something) and I obtain an “unspecified launch failure”.

I used cuda-memcheck but it didn’t find any error in the code.

The biggest ammount of data that I have reached are 200 points and nodes and the limit
is about 50. For limit 150 the application fails with the aforementioned flickering of the screen.
However the memory requirements for limit 150 are very small (only 4.12445 mb).

I am posting the code that fails to run properly. It consists of two files,
kernel.cu and main.cpp.
Please notice that no local memory is allocated within the kernel. Working memory is supplied as argument.

I am using windows, the available ram is 32.0 Gb and from
task manager I have the following numbers:

Dedicated GPU is 3.0 Gb
GPU memory is 19.0Gb
Shared GPU memory 16.0Gb.

The identification of the graphics card is the following:

Device 0: “GeForce GTX 670MX”
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 3072 MBytes (3221225472 bytes)
( 5) Multiprocessors, (192) CUDA Cores/MP: 960 CUDA Cores
GPU Max Clock rate: 601 MHz (0.60 GHz)
Memory Clock rate: 1400 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 393216 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

the cuda-memcheck gives me the following diagnostic. This diagnostic does not appear when nlimit = 50;

“c:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\cuda-memcheck.exe” --leak-check full cuda_post.exe
========= CUDA-MEMCHECK
allocated memory on device 3.69648e+07 bytes 36098.4 kb 35.2524 mb 0.0344262 gb
unspecified launch failure
cudaStatus != cudaSuccess (failure) in file ./cuda_post/cuda_post/cuda_post/kernel.cu line 170
========= Program hit cudaErrorLaunchFailure (error 719) due to “unspecified launch failure” on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\nvcuda.dll (cuMemcpy2DAsync + 0x2fa194) [0x3088cb]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (cudaDeviceSynchronize + 0xf8) [0x3c838]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (kernel_interface + 0x547) [0x93d57]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (calc_rbf + 0x3e5) [0x94425]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (run_data + 0x19b) [0x9dbcb]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (main + 0x2f) [0x9df5f]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (invoke_main + 0x34) [0x9f654]
========= Host Frame:d:\esell\dev\tests\cuda_post\cuda_post\x64\Debug\cuda_post.exe (__scrt_common_main_seh + 0x12e) [0x9f53e]

Any help is appreciated.

main.cpp file follows:

#include <cstdlib>
#include <iostream>
#include <iomanip>
#include <cstdio>
#include <vector>
static void run_data();
void calc_rbf(const size_t &numBlock, const size_t &numThread, const size_t &nsp, const size_t &num_npoint, const double *points, const size_t &num_node, const std::vector<double> &nodes, const size_t &nlimit, size_t &act_node, double **shp);
int main() {
    run_data();
}
static void calc_grid(const double &len, const size_t &nsp, const size_t &npt, double *pt) {
    size_t i, j, k, lnd;
    double dl;
    dl = len / (npt - 1.);
    for (i=0;i!=npt;i++) { // x
        for (j=0;j!=npt;j++) { // y
            for (k=0;k!=npt;k++) { // z
                lnd = (i * npt + j) * npt + k;
                pt[lnd * nsp + 0] = i * dl;
                pt[lnd * nsp + 1] = j * dl;
                pt[lnd * nsp + 2] = k * dl;
            }
        }
    }
}
static void run_data() {
    const size_t LIMIT_PRINT = 3;
    const size_t NSP = 3;

// PURPOSE: Dimensionality
const size_t NUM_NODES = 200, TOT_NODES = NUM_NODES * NUM_NODES * NUM_NODES;
// PURPOSE: Number of nodes per direction, total number of nodes
const size_t NUM_NPOINT = 200, TOT_NPOINT = NUM_NPOINT * NUM_NPOINT * NUM_NPOINT;
// PURPOSE: Number of points per direction, total number of points.
const double LEN = 1.0;
// PURPOSE: Dimension of the grid (lx=ly=lz)
size_t i, j, k, nlimit, lnd, act_node, ipt, ind;
// PURPOSE: (nlimit) maximum number of closest nodes in the vicinity of each point.
size_t numBlock, numThread;
// PURPOSE: Number of blocks in the grid, and number of threads per block
std::vector nodes(TOT_NODES * NSP);
double *points = 0, *shp = 0;
// PURPOSE: This memory will be copied to device memory.
// Allocate memory for points
points = new double [TOT_NPOINT * NSP ];
nlimit = 50;
// Calculate node grid
calc_grid(LEN, NSP, NUM_NODES, &nodes[ 0 ]);
// Calculate point grid
calc_grid(LEN, NSP, NUM_NPOINT, points);
//
numBlock = 1;
numThread = 256;
calc_rbf(numBlock, numThread, NSP, NUM_NPOINT, points, NUM_NPOINT, nodes, nlimit, act_node, &shp);
//
std::cout<<std::setprecision(8)<<std::scientific;
for (ipt=0;ipt!=NUM_NPOINT;ipt++) {
if ( ipt <LIMIT_PRINT || NUM_NPOINT - ipt < LIMIT_PRINT ) {
for (ind=0;ind!=act_node;ind++) {
if ( ind <LIMIT_PRINT || act_node - ind < LIMIT_PRINT ) {
std::cout<<std::setw(4)<<ipt<<std::setw(4)<<ind<<std::setw(16)<<shp[ipt * act_node + ind]<<“\n”;
}
}
}
}
//
delete shp;
//
delete points;
}

file kernel.cu follows
#include
#include
#include
#include
#include
#include
#include “cuda_runtime.h”
#include “device_launch_parameters.h”
device double wfunc(const size_t &nsp, const double *a, const double *b) {
const double dk = 2.0, dcon = 0.01, dpw = 0.5;
double wf, d;
size_t i;
d = 0.0;
for (i=0;i!=nsp;i++) d += pow(a[ i ] - b[ i ], 2.0);
d = sqrt(d);
wf = pow( (pow(d, dk) + dcon), dpw);
return wf;
}
global void rbfKernel(const size_t nsp, const size_t num_npoint, const double *point, const size_t act_node, const double *nd_coord, double *wrk_mat, double *wrk_vec, double *shp) {
size_t ipt, offset, ind, jnd;
//
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
//
for (ipt=index;ipt <num_npoint;ipt+=stride) {
// Fill the matrix
offset = ipt * act_node;
double *wmat = &wrk_mat[ offset ];
// PURPOSE: Column storage mode (mimics fortran’s matrices)
for (ind=0;ind!=act_node;ind++) {
for (jnd=0;jnd!=act_node;jnd++) {
wmat[ind + jnd * act_node] = wfunc(nsp, &nd_coord[(ipt * act_node + ind) * nsp], &nd_coord[(ipt * act_node + jnd) * nsp]);
}
}
// Invert matrix {}
// Here I will do a fake invert
for (ind=0;ind!=act_node;ind++) {
for (jnd=0;jnd!=act_node;jnd++) {
wmat[ind + jnd * act_node] = 1.0 / wmat[ind + jnd * act_node];
}
}
// calculate
double *wvec = &wrk_vec[ offset ];
// PURPOSE:
for (ind=0;ind!=act_node;ind++) {
wvec[ ind ] = wfunc(nsp, &point[ipt * nsp], &nd_coord[(ipt * act_node + ind) * nsp]);
}
// Compute shape (dgemv)
double *wshp = &shp[ offset ];
// PURPOSE:
for (ind=0;ind!=act_node;ind++) wshp[ ind ] = 0.0;
for (ind=0;ind!=act_node;ind++) {
for (jnd=0;jnd!=act_node;jnd++) {
wshp[ ind ] += wvec[ jnd ] * wmat[ind + jnd * act_node];
}
}
}
}
void check_cuda_status(const cudaError_t &cudaStatus, const char *file, const size_t &line) {
if (cudaStatus != cudaSuccess) {
std::cout<<cudaGetErrorString(cudaStatus)<<“\n”;
std::cout<<" cudaStatus != cudaSuccess (failure) in file “<<file<<” line “<<line<<”\n";
exit(1);
}
}
void kernel_interface(const size_t &numBlock, const size_t &numThread, const size_t &nsp, const size_t &num_npoint, const double *points, const size_t &act_node, const double *nd_coord, double *shp) {
size_t req_nd, req_pt, req_mat, req_vec, req_shp;
float tot_mem;
double *dev_point = 0, *dev_nd_coord = 0, *dev_wrk_mat, *dev_wrk_vec, *dev_shp;
// PURPOSE:
// Allocate device memory
req_pt = num_npoint * nsp * sizeof(double);
check_cuda_status( cudaMalloc((void **)&dev_point, req_pt), FILE, LINE);
req_nd = num_npoint * act_node * nsp * sizeof(double);
check_cuda_status( cudaMalloc((void **)&dev_nd_coord, req_nd), FILE, LINE);
// rbf functions as many as points * act_node
req_shp = num_npoint * act_node * sizeof(double);
check_cuda_status( cudaMalloc((void **)&dev_shp, req_shp), FILE, LINE);
// Allocate work memory on device.
// matrices as many as points, with dimension [node x node]
req_mat = num_npoint * act_node * act_node * sizeof(double);
check_cuda_status( cudaMalloc((void **)&dev_wrk_mat, req_mat), FILE, LINE);
// vectors as many as points, with dimension [node]
req_vec = num_npoint * act_node * sizeof(double);
check_cuda_status( cudaMalloc((void **)&dev_wrk_vec, req_vec), FILE, LINE);
//
// Total allocated memory on device
tot_mem = req_pt + req_nd + req_mat + req_vec;
std::cout<<“allocated memory on device “<<tot_mem<<” bytes “<<(tot_mem / 1024.)<<” kb “<<(tot_mem / (1024. * 1024.))<<” mb “<<(tot_mem / (1024. * 1024. * 1024.))<<” gb\n”;
//
// copy memory
check_cuda_status( cudaMemcpy( dev_point, points, req_pt, cudaMemcpyHostToDevice), FILE, LINE);
check_cuda_status( cudaMemcpy( dev_nd_coord, nd_coord, req_nd, cudaMemcpyHostToDevice), FILE, LINE);
//
// Launch kernel on the GPU
rbfKernel<<<numBlock, numThread>>>(nsp, num_npoint, dev_point, act_node, dev_nd_coord, dev_wrk_mat, dev_wrk_vec, dev_shp);
//
check_cuda_status( cudaGetLastError(), FILE, LINE );
check_cuda_status( cudaDeviceSynchronize(), FILE, LINE );
check_cuda_status( cudaMemcpy(shp, dev_shp, req_shp, cudaMemcpyDeviceToHost), FILE, LINE );
//
// Deallocate device memory
check_cuda_status( cudaFree(dev_point), FILE, LINE);
check_cuda_status( cudaFree(dev_nd_coord), FILE, LINE);
check_cuda_status( cudaFree(dev_shp), FILE, LINE);
check_cuda_status( cudaFree(dev_wrk_mat), FILE, LINE);
check_cuda_status( cudaFree(dev_wrk_vec), FILE, LINE);
//
cudaDeviceReset();
}
double distance(const size_t &nsp, const double *a, const double *b) {
size_t i;
// PURPOSE:
double d;
// PURPOSE:
d = 0.0;
for (i=0;i!=nsp;i++) d += pow(a[ i ] - b[ i ], 2.0);
d = sqrt( d );
return d;
}
struct closest_node {
size_t knd_;
// PURPOSE:
double dist_;
// PURPOSE:
};
bool sort_criterio(const closest_node &a, const closest_node &b) {return a.dist_ < b.dist_;}
// REQUIRE:
// PROMISE: Sorting criterio
void calc_rbf(const size_t &numBlock, const size_t &numThread, const size_t &nsp, const size_t &num_npoint, const double *points, const size_t &num_node, const std::vector &nodes, const size_t &nlimit, size_t &act_node, double **shp) {
size_t ipt, ind, isp;
// PURPOSE:
double d;
// PURPOSE:
std::vector<closest_node> neigh_node;
// PURPOSE: Closest nodes
double *nd_coord = 0;
// PURPOSE:
act_node = std::min(num_node, nlimit);
nd_coord = new double [num_npoint * act_node * nsp ];
for (ipt=0;ipt!=num_npoint;ipt++) {
// Loop over all nodes and collect the nlimit closest ones.
neigh_node.clear();
for (ind=0;ind!=num_node;ind++) {
d = distance(nsp, &points[ipt * nsp], &nodes[ipt * nsp]);
closest_node cln;
// PURPOSE:
cln.knd_ = ind;
cln.dist_ = d;
neigh_node.push_back( cln );
}
std::sort(neigh_node.begin(), neigh_node.end(), sort_criterio);
// Copy coordinates
for (ind=0;ind!=act_node;ind++) {
for (isp=0;isp!=nsp;isp++) {
nd_coord[(ipt * act_node + ind) * nsp + isp] = nodes[ neigh_node[ ind ].knd_ * nsp + isp ];
}
}
}
//
*shp = new double[ num_npoint * act_node ];
kernel_interface(numBlock, numThread, nsp, num_npoint, points, act_node, nd_coord, *shp);
//
delete nd_coord;
}