[solved] Array not fully transfered to device memory

Hi everyone,

I need to analyze some data, and I want to use GPU to speed up the process, which is otherwise quite long.
I need to compute an intermediate scattering function using atoms trajectories from molecular dynamics simulation.
For this, I have a 3D array ‘atomPos’ (dim0 -> atoms, dim1 -> trajectory frame, dim2 -> xyz coordinates), a 3D array of scattering q vectors ‘qVecs’ (dim0 -> vector amplitude, dim1 -> random vectors of given amplitude, dim2 -> xyz coordinates).
Finally everything is stored in a 2D array ‘out’ (dim0 -> number of q vectors amplitude used, dim1 -> number of time interval computed (number of trajectory frames) ).
All arrays are flattened.

The code below is compiled with nvcc using ‘nvcc -lib -o fileName.lib fileName.cu’, then I use .cpp file as a wrapper together with .pyx file, such that everything is cythonize and can be called from a Python software.
The idea is to use one thread per atom, and run all the loops for each of them independently. Then the result is added in the corresponding position (q vector amplitude and trajectory frame) in the ‘out’ array using atomic operations.

When I try to run this, I can see it starts, because I can print the state of execution from the kernel using printf. However, it quickly crashes, with an unspecified kernel launch failure.
Looking at memory allocation, I can see that not all my first array was loaded, because it should use 720 MB, but only 200 MB is used on GPU during the execution as seen using the windows resources monitor.

Is there some limitation on the amount of data that can be loaded on GPU at once?

I’m using a GeForce 940 MX on a Asus Laptop.

Thanks in advance for your help.

The compIntScatFunc.cu file:

#include <stdio.h>

#include <cuda.h>
#include <cuda_runtime.h>

#define BLOCK_SIZE 512

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__
void compIntScatFunc(float *atomPos, int atomPos_dim0, int atomPos_dim1, float *out,
                     float *qVecs, int qVecs_dim0, int qVecs_dim1, 
                     int maxFrames, int nbrTimeOri) 
{
    int atomId = blockDim.x * blockIdx.x + threadIdx.x;

    if( atomId < atomPos_dim0 )
    {
        extern __shared__ float s_qVecs[];

        for(int i=0; i < 3*qVecs_dim0*qVecs_dim1; ++i)
            s_qVecs[i] = qVecs[i];

        __syncthreads();

// Starts computation of intermediate scattering function
        for(int qValId=0; qValId < qVecs_dim0; ++qValId)
        {
            for(int dt=0; dt < maxFrames; ++dt)
            {
                int timeIncr = (int)( atomPos_dim1 - dt) / nbrTimeOri; 

                for(int t0=0; t0 < nbrTimeOri; ++t0)
                {
                    for(int qVecId=0; qVecId < qVecs_dim1; ++qVecId)
                    {
                        // Gets indices
                        int atom_tf_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr + dt); 
                        int atom_t0_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr);

                        int qVec_idx = 3 * (qValId * qVecs_dim1 + qVecId);

                        // Computes distances for given timestep and atom
                        float dist_0 = atomPos[atom_tf_idx] - atomPos[atom_t0_idx];
                        float dist_1 = atomPos[atom_tf_idx+1] - atomPos[atom_t0_idx+1];
                        float dist_2 = atomPos[atom_tf_idx+2] - atomPos[atom_t0_idx+2];

                        float re = cos( s_qVecs[qVec_idx] * dist_0 
                                        + s_qVecs[qVec_idx+1] * dist_1
                                        + s_qVecs[qVec_idx+2] * dist_2 );

                        float im = sin( s_qVecs[qVec_idx] * dist_0 
                                        + s_qVecs[qVec_idx+1] * dist_1
                                        + s_qVecs[qVec_idx+2] * dist_2 );

                        atomicAdd( &out[2*(qValId*maxFrames + dt)], re / (nbrTimeOri*qVecs_dim1) );
                        atomicAdd( &out[2*(qValId*maxFrames + dt) + 1], im / (nbrTimeOri*qVecs_dim1) );

} // q vectors loop
                } // time origins loop 
            } // time increments loop
        } // qVals loop
    } // condition on atom index
}

void cu_compIntScatFunc_wrapper(float *atomPos, int atomPos_dim0, int atomPos_dim1, int atomPos_dim2, 
                                float *qVecs, int qVecs_dim0, int qVecs_dim1, int qVecs_dim2, 
                                float *out, int maxFrames, int nbrTimeOri)
{
    // Copying atomPos matrix on GPU memory
    float *cu_atomPos;
    size_t size_atomPos = atomPos_dim0 * atomPos_dim1 * atomPos_dim2 * sizeof(float);
    cudaMalloc(&cu_atomPos, size_atomPos);
    cudaMemcpy(cu_atomPos, atomPos, size_atomPos, cudaMemcpyHostToDevice);

    // Copying qVecs matrix on GPU memory
    float *cu_qVecs;
    size_t size_qVecs = qVecs_dim0 * qVecs_dim1 * qVecs_dim2 * sizeof(float);
    cudaMalloc(&cu_qVecs, size_qVecs);
    cudaMemcpy(cu_qVecs, qVecs, size_qVecs, cudaMemcpyHostToDevice);

    // Copying out matrix on GPU memory
    float *cu_out;
    size_t size_out = 2 * qVecs_dim0 * maxFrames * sizeof(float);
    cudaMalloc(&cu_out, size_out);
    cudaMemcpy(cu_out, out, size_out, cudaMemcpyHostToDevice);

int nbrBlocks = ceil(atomPos_dim0 / BLOCK_SIZE);
    int sharedMemSize = sizeof(float) * 3 * (qVecs_dim0*qVecs_dim1); 

    compIntScatFunc<<<nbrBlocks, BLOCK_SIZE, sharedMemSize>>>(cu_atomPos, atomPos_dim0, atomPos_dim1, cu_out,
                                                cu_qVecs, qVecs_dim0, qVecs_dim1, maxFrames, nbrTimeOri);
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );

    // Averages with the number of atoms
    for(int i=0; i < 2*qVecs_dim0*maxFrames; ++i)
    {
        cu_out[i] /= atomPos_dim0;
    }

cudaMemcpy(out, cu_out, size_out, cudaMemcpyDeviceToHost);

    cudaFree(cu_atomPos);
    cudaFree(cu_qVecs);
    cudaFree(cu_out);
}

And the cuda_setup.py file:

from Cython.Build import cythonize
import numpy as np

import os

from setuptools import setup
from setuptools import Extension
from setuptools import Command
from setuptools.command.build_ext import build_ext
import distutils

with open('../README.md', 'r') as f:
    description = f.read()

pyxPath = "NAMDAnalyzer/lib/cython_pyx/"
srcPath = "NAMDAnalyzer/lib/openmp/src/" 
cudaSrcPath = "NAMDAnalyzer/lib/cuda/src/"

try:
    cudaPath = os.environ['CUDA_PATH'] #_Using the default installation key in Path variable
    if os.environ['OS'] == 'Windows_NT':
        cudaInclude = cudaPath + "\include"
        cudaLib     = cudaPath + "\lib\x64"
    else:
        cudaInclude = cudaPath + "/include"
        cudaLib     = cudaPath + "/lib64"
except KeyError:
    print("\n\nError: Couldn't locate CUDA path, please intall it or add it to PATH variable\n\n")

#_The following is used to compile with openmp with both mingGW and msvc
copt =  {'msvc'     : ['/openmp', '/Ox', '/fp:fast'],
         'mingw32'  : ['-fopenmp','-O3','-ffast-math','-march=native'] }
lopt =  {'mingw32'  : ['-fopenmp']}

def preprocessNVCC(path):
    #_Used to process .cu file and create static libraries for GPU part of the program

    for f in os.listdir(path):
        if f[-3:] == '.cu':
            os.system("nvcc -o %s.lib -lib %s" % ('NAMDAnalyzer/lib/cuda/' + f[:-3], path + f))

#_Used by setup function to define compile and link extra arguments
class build_ext_subclass( build_ext ):
    def build_extensions(self):
        c = self.compiler.compiler_type
        if c in copt.keys():
           for e in self.extensions:
               e.extra_compile_args = copt[ c ]
        if c in lopt.keys():
            for e in self.extensions:
                e.extra_link_args = lopt[ c ]

        #_Simply uses execute function to make sure .cu files are processed first with nvcc
        self.execute(preprocessNVCC, [cudaSrcPath])

        build_ext.build_extensions(self)

packagesList = [    'NAMDAnalyzer.dataManipulation',
                    'NAMDAnalyzer.dataParsers',
                    'NAMDAnalyzer.dataAnalysis',
                    'NAMDAnalyzer.lib',
                    'NAMDAnalyzer.helpersFunctions' ]

#_Defines extensions
pycompIntScatFunc_ext   = Extension( "NAMDAnalyzer.lib.pycompIntScatFunc", 
                                   [cudaSrcPath + "compIntScatFunc.cpp", 
                                    pyxPath + "pycompIntScatFunc.pyx"],
                                   library_dirs=["NAMDAnalyzer/lib/cuda", cudaLib],
                                   libraries=['compIntScatFunc', 'cuda', 'cudart'],
                                   language='c++',
                                   include_dirs=[cudaSrcPath, np.get_include(), cudaInclude])

pygetDistances_ext      = Extension( "NAMDAnalyzer.lib.pygetDistances", 
                                   [pyxPath + "pygetDistances.pyx", 
                                    srcPath + "getDistances.cpp"],
                                   include_dirs=[srcPath,  np.get_include()],
                                   language='c++')

pygetWithin_ext         = Extension( "NAMDAnalyzer.lib.pygetWithin", 
                                   [pyxPath + "pygetWithin.pyx", srcPath + "getWithin.cpp"],
                                   include_dirs=[srcPath,  np.get_include()],
                                   language='c++')

pygetCOM_ext            = Extension( "NAMDAnalyzer.lib.pygetCenterOfMass", 
                                    [pyxPath + "pygetCenterOfMass.pyx"],
                                    include_dirs=[np.get_include()] ) 

pysetCOMAligned_ext     = Extension( "NAMDAnalyzer.lib.pysetCenterOfMassAligned", 
                                    [pyxPath + "pysetCenterOfMassAligned.pyx"],
                                    include_dirs=[np.get_include()] ) 

setup(  name='NAMDAnalyzer',
        version='alpha',
        description=description,
        author='Kevin Pounot',
        author_email='kpounot@hotmail.fr',
        url='github.com/kpounot/NAMDAnalyzer',
        py_modules=['NAMDAnalyzer.Dataset'],
        packages=packagesList,
        ext_modules=cythonize( [pygetWithin_ext,
                                pygetDistances_ext,
                                pygetCOM_ext,
                                pysetCOMAligned_ext,
                                pycompIntScatFunc_ext]),
        cmdclass={'build_ext': build_ext_subclass})

Occam’s Razor suggests there is a bug in your code somewhere.

I don’t see any error checking for CUDA API calls and would suggest adding that. Alternatively, try running under control of cuda-memcheck.

Double check the source data that goes into the computation of size_atomPos, size_qVecs, and size_out. Incorrect data could overflow ‘int’.

Disclaimer: I don’t know what I’m doing, but enjoy reading code for learning. @njuffa is a bit of a saint here on these forums.

“unspecified kernel launch failure” is an infuriatingly frustrating error to get.

There’s some information in this post on SO: https://stackoverflow.com/questions/27277365/unspecified-launch-failure-on-memcpy/27278218#27278218

I’ve run into issues allocating more than 2GB at a time. I use Eclipse, which was a real pain to get working, but had some code samples that provided some utility and since it looks like you’re doing everything from the command line, I’m including the protection blocks that the samples provided.

There is a method called cudaMalloc3d that I’ve never used, but the documentation leads me to believed that the pitched memory performance may be better for 3d arrays. http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/group__CUDART__MEMORY_g04a7553c90322aef32f8544d5c356a10.html

These are the macros that I use from the sample code. Include them in your cuda calls and see if they give you any more information:

static void CheckCudaErrorAux (const char *, unsigned, const char *, cudaError_t);
#define CUDA_CHECK_RETURN(value) CheckCudaErrorAux(__FILE__,__LINE__, #value, value)

/**
 * Check the return value of the CUDA runtime API call and exit
 * the application if the call has failed.
 */
static void CheckCudaErrorAux (const char *file, unsigned line, const char *statement, cudaError_t err)
{
	if (err == cudaSuccess)
		return;
	std::cerr << statement<<" returned " << cudaGetErrorString(err) << "("<<err<< ") at "<<file<<":"<<line << std::endl;
	exit (1);
}

Example:

CUDA_CHECK_RETURN(cudaDeviceSynchronize());

Also, and I say this with minimal confidence (I did preface saying that I don’t know what I’m doing) and may live up to my screen name in making such a suggestion, and I’m sure you’re just trying to get your code to work right now and will optimize later. There are some things that stood out to me within your loops that you can probably optimize the hell out of. The array index access within the inner loops, if the compiler doesn’t optimize them out, may cost you a bunch of performance. From what I’ve read, but not tested myself, using registers and performing those array reads as few times as possible will be much more efficient as accessing external memory is very expensive. Also, the loading of your shared array has a bit of a smell to it. It looks like you’ll be accessing the same address repeatedly from multiple threads. There might be some optimization realized by using a different iterator:

for(int i=threadIdx.x; i < 3*qVecs_dim0*qVecs_dim1; i+=blockDim.x)
            s_qVecs[i] = qVecs[i];

you may simply be running into a kernel timeout

https://docs.nvidia.com/gameworks/content/developertools/desktop/nsight/timeout_detection_recovery.htm

Thanks for your answers.

Actually, I was using a subset of atoms, so apparently, everything is correctly loaded on the GPU.

But I still get this failure, even with 30 seconds for this TDR. However, it doesn’t seem that the GPU is really doing anything as the GPU usage always remains at 0 %.
If I simply remove the TDR, it just runs forever without computing anything…

Here is the result of a cuda-memcheck within the IPython interpreter:

GPUassert: unspecified launch failure NAMDAnalyzer/lib/cuda/src/compIntScatFunc.cu 116
========= Program hit cudaErrorLaunchFailure (error 719) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize
.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\Windows\system32\nvcuda.dll (cuMemcpyPeer_ptds + 0x3043ac) [0x313beb]
=========     Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.1\bin\cudart64_101.dll (cudaDeviceSynchronize + 0xf8)
[0x1f4b8]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\lib\site-packages\namdanalyzer-alpha-py3.7-win-a
md64.egg\NAMDAnalyzer\lib\pycompIntScatFunc.cp37-win_amd64.pyd [0x13cc]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\lib\site-packages\namdanalyzer-alpha-py3.7-win-a
md64.egg\NAMDAnalyzer\lib\pycompIntScatFunc.cp37-win_amd64.pyd (PyInit_pycompIntScatFunc + 0x1325) [0x8e15]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\lib\site-packages\namdanalyzer-alpha-py3.7-win-a
md64.egg\NAMDAnalyzer\lib\pycompIntScatFunc.cp37-win_amd64.pyd (PyInit_pycompIntScatFunc + 0x195e) [0x944e]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0x355) [0x2ecf5]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xc17) [0x2f5b7]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x4af) [
0x2fc7f]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalCodeWithName + 0x1a6) [
0x16f36]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xccc) [0x2f66c]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x403) [
0x2fbd3]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalCodeWithName + 0x1a6) [
0x16f36]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xccc) [0x2f66c]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x403) [
0x2fbd3]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalCodeWithName + 0x1a6) [
0x16f36]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xccc) [0x2f66c]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0xffa) [
0x307ca]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalCodeWithName + 0x1a6) [
0x16f36]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyErr_Clear + 0x1a3) [0x37053]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyErr_Clear + 0x79) [0x36f29]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xb0) [0x2ea50]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xc17) [0x2f5b7]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x4af) [
0x2fc7f]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_SaveThread + 0x670) [0x1f0e
0]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyErr_NoMemory + 0x2049a) [0x7e87e
]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_SaveThread + 0x670) [0x1f0e
0]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyErr_NoMemory + 0x2049a) [0x7e87e
]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_SaveThread + 0x670) [0x1f0e
0]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xd1) [0x2ea71]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xa5c) [0x2f3fc]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x403) [
0x2fbd3]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xbb3) [0x2f553]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x4af) [
0x2fc7f]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xbb3) [0x2f553]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x403) [
0x2fbd3]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalCodeWithName + 0x1a6) [
0x16f36]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xccc) [0x2f66c]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0xffa) [
0x307ca]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalCodeWithName + 0x1a6) [
0x16f36]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xccc) [0x2f66c]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x403) [
0x2fbd3]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalCodeWithName + 0x1a6) [
0x16f36]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xccc) [0x2f66c]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x403) [
0x2fbd3]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xbb3) [0x2f553]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x403) [
0x2fbd3]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalCodeWithName + 0x1a6) [
0x16f36]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyFunction_FastCallDict + 0x1ba) [
0x16bba]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallDict + 0x3a
a) [0x15b3a]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyObject_SetAttr + 0x2ec) [0x12f2c
]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x1175)
[0x30945]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalCodeWithName + 0x1a6) [
0x16f36]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyMethodDef_RawFastCallKeywords +
0xccc) [0x2f66c]
=========     Host Frame:c:\users\kevin pounot\appdata\local\programs\python\python37\python37.dll (PyEval_EvalFrameDefault + 0x4af) [
0x2fc7f]
=========
========= ERROR SUMMARY: 1 error

So you are hitting a TDR. Without it your code runs “forever”.

Time to debug that. It is certainly not strange. You have 4 nested loops, after all. Maybe your loop extents are large and you are running many blocks and threads.

Thanks fr your answer.
To give some figures, atomPos dimensions are (12212, 1000, 3), ‘qVecs’ are (9, 15, 3), and ‘out’ is (15, 1900).
Also nbrTimeOri was set to 50 here.

So I have 9 iterations for the first loop, 1000 for the second, 50 for the third, and 15 for the last one.

I shall try to reduce the number of ‘dt’. It will probably make the subsequent Fourier transform a bit less accurate but it should be fine…

Your indicated size for out as defined by the code (cu_out allocation) didn’t seem to match this:

size_t size_out = 2 * qVecs_dim0 * maxFrames * sizeof(float);

so I used your code definition instead.

you do have a defect in the code you have posted here. This is not correct:

gpuErrchk( cudaDeviceSynchronize() );

    // Averages with the number of atoms
    for(int i=0; i < 2*qVecs_dim0*maxFrames; ++i)
    {
        cu_out[i] /= atomPos_dim0;  // cu_out is a device pointer, and you are attempting to dereference it in host code
    }

cudaMemcpy(out, cu_out, size_out, cudaMemcpyDeviceToHost);

It is illegal to dereference a device pointer (cu_out) in host code.

Instead you should do something like this:

gpuErrchk( cudaDeviceSynchronize() );
    cudaMemcpy(out, cu_out, size_out, cudaMemcpyDeviceToHost);

    // Averages with the number of atoms
    for(int i=0; i < 2*qVecs_dim0*maxFrames; ++i)
    {
        out[i] /= atomPos_dim0;   //modify data on the host instead
    }

When I make those changes and build a simple test harness:

$ cat test.cu
#include <stdio.h>

#include <cuda.h>
#include <cuda_runtime.h>

#define BLOCK_SIZE 512

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__
void compIntScatFunc(float *atomPos, int atomPos_dim0, int atomPos_dim1, float *out,
                     float *qVecs, int qVecs_dim0, int qVecs_dim1,
                     int maxFrames, int nbrTimeOri)
{
    int atomId = blockDim.x * blockIdx.x + threadIdx.x;

    if( atomId < atomPos_dim0 )
    {
        extern __shared__ float s_qVecs[];

        for(int i=0; i < 3*qVecs_dim0*qVecs_dim1; ++i)
            s_qVecs[i] = qVecs[i];

        __syncthreads();

// Starts computation of intermediate scattering function
        for(int qValId=0; qValId < qVecs_dim0; ++qValId)
        {
            for(int dt=0; dt < maxFrames; ++dt)
            {
                int timeIncr = (int)( atomPos_dim1 - dt) / nbrTimeOri;

                for(int t0=0; t0 < nbrTimeOri; ++t0)
                {
                    for(int qVecId=0; qVecId < qVecs_dim1; ++qVecId)
                    {
                        // Gets indices
                        int atom_tf_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr + dt);
                        int atom_t0_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr);

                        int qVec_idx = 3 * (qValId * qVecs_dim1 + qVecId);

                        // Computes distances for given timestep and atom
                        float dist_0 = atomPos[atom_tf_idx] - atomPos[atom_t0_idx];
                        float dist_1 = atomPos[atom_tf_idx+1] - atomPos[atom_t0_idx+1];
                        float dist_2 = atomPos[atom_tf_idx+2] - atomPos[atom_t0_idx+2];

                        float re = cos( s_qVecs[qVec_idx] * dist_0
                                        + s_qVecs[qVec_idx+1] * dist_1
                                        + s_qVecs[qVec_idx+2] * dist_2 );

                        float im = sin( s_qVecs[qVec_idx] * dist_0
                                        + s_qVecs[qVec_idx+1] * dist_1
                                        + s_qVecs[qVec_idx+2] * dist_2 );

                        atomicAdd( &out[2*(qValId*maxFrames + dt)], re / (nbrTimeOri*qVecs_dim1) );
                        atomicAdd( &out[2*(qValId*maxFrames + dt) + 1], im / (nbrTimeOri*qVecs_dim1) );

} // q vectors loop
                } // time origins loop
            } // time increments loop
        } // qVals loop
    } // condition on atom index
}

void cu_compIntScatFunc_wrapper(float *atomPos, int atomPos_dim0, int atomPos_dim1, int atomPos_dim2,
                                float *qVecs, int qVecs_dim0, int qVecs_dim1, int qVecs_dim2,
                                float *out, int maxFrames, int nbrTimeOri)
{
    // Copying atomPos matrix on GPU memory
    float *cu_atomPos;
    size_t size_atomPos = atomPos_dim0 * atomPos_dim1 * atomPos_dim2 * sizeof(float);
    cudaMalloc(&cu_atomPos, size_atomPos);
    cudaMemcpy(cu_atomPos, atomPos, size_atomPos, cudaMemcpyHostToDevice);

    // Copying qVecs matrix on GPU memory
    float *cu_qVecs;
    size_t size_qVecs = qVecs_dim0 * qVecs_dim1 * qVecs_dim2 * sizeof(float);
    cudaMalloc(&cu_qVecs, size_qVecs);
    cudaMemcpy(cu_qVecs, qVecs, size_qVecs, cudaMemcpyHostToDevice);

    // Copying out matrix on GPU memory
    float *cu_out;
    size_t size_out = 2 * qVecs_dim0 * maxFrames * sizeof(float);
    cudaMalloc(&cu_out, size_out);
    cudaMemcpy(cu_out, out, size_out, cudaMemcpyHostToDevice);

int nbrBlocks = ceil(atomPos_dim0 / BLOCK_SIZE);
    int sharedMemSize = sizeof(float) * 3 * (qVecs_dim0*qVecs_dim1);

    compIntScatFunc<<<nbrBlocks, BLOCK_SIZE, sharedMemSize>>>(cu_atomPos, atomPos_dim0, atomPos_dim1, cu_out,
                                                cu_qVecs, qVecs_dim0, qVecs_dim1, maxFrames, nbrTimeOri);
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );

    cudaMemcpy(out, cu_out, size_out, cudaMemcpyDeviceToHost);

    // Averages with the number of atoms
    for(int i=0; i < 2*qVecs_dim0*maxFrames; ++i)
    {
        out[i] /= atomPos_dim0;
    }

cudaFree(cu_atomPos);
    cudaFree(cu_qVecs);
    cudaFree(cu_out);
}

int main(){

  float *atomPos, *qVecs, *out;
  int atomPos_dim0 = 12212;
  int atomPos_dim1 = 1000;
  int atomPos_dim2 = 3;
  int qVecs_dim0   = 9;
  int qVecs_dim1   = 15;
  int qVecs_dim2   = 3;
  int maxFrames    = 1000;
  int out_dim      = 2*qVecs_dim0*maxFrames;
  int nbrTimeOri   = 50;
  atomPos = (float *)malloc(atomPos_dim0*atomPos_dim1*atomPos_dim2*sizeof(atomPos[0]));
  qVecs   = (float *)malloc(qVecs_dim0*qVecs_dim1*qVecs_dim2*sizeof(qVecs[0]));
  out     = (float *)malloc(out_dim*sizeof(out[0]));
  cu_compIntScatFunc_wrapper(atomPos, atomPos_dim0, atomPos_dim1, atomPos_dim2,
                             qVecs, qVecs_dim0, qVecs_dim1, qVecs_dim2,
                             out, maxFrames, nbrTimeOri);
  return 0;
}

your kernel runs for me in about 30 seconds on a Tesla V100 (a newer, large, fast GPU):

$ nvcc -o test test.cu
$ cuda-memcheck ./test
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvprof ./test
==27311== NVPROF is profiling process 27311, command: ./test
==27311== Profiling application: ./test
==27311== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.87%  29.5245s         1  29.5245s  29.5245s  29.5245s  compIntScatFunc(float*, int, int, float*, float*, int, int, int, int)
                    0.13%  39.462ms         3  13.154ms  1.8880us  39.451ms  [CUDA memcpy HtoD]
                    0.00%  7.1040us         1  7.1040us  7.1040us  7.1040us  [CUDA memcpy DtoH]
      API calls:   98.82%  29.5245s         1  29.5245s  29.5245s  29.5245s  cudaDeviceSynchronize
                    1.01%  302.00ms         3  100.67ms  9.8970us  301.73ms  cudaMalloc
                    0.13%  39.922ms         4  9.9805ms  18.963us  39.731ms  cudaMemcpy

and in about 800 seconds on a Quadro K2000 (a very small, old, slow GPU):

$ nvprof ./t44
==1244== NVPROF is profiling process 1244, command: ./t44
==1244== Profiling application: ./t44
==1244== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.99%  788.953s         1  788.953s  788.953s  788.953s  compIntScatFunc(float*, int, int, float*, float*, int, int, int, int)
                    0.01%  44.362ms         3  14.787ms  1.4390us  44.347ms  [CUDA memcpy HtoD]
                    0.00%  11.551us         1  11.551us  11.551us  11.551us  [CUDA memcpy DtoH]
      API calls:   99.97%  788.953s         1  788.953s  788.953s  788.953s  cudaDeviceSynchronize
                    0.02%  155.26ms         3  51.754ms  188.90us  154.87ms  cudaMalloc
                    0.01%  44.581ms         4  11.145ms  13.725us  44.441ms  cudaMemcpy

My guess is that with the fix for the illegal code you have (after the kernel) you should witness some kernel execution duration between those two values on your GPU. However in my experience, although you can disable the TDR for minutes or longer, the windows GUI doesn’t like to be held off that long, and I often experience machine instability when I do that. If you want to run code like this, my best suggestion would be to get off your windows laptop and do it in a more serious setting for scientific computing. Or simply reduce your problem scale (e.g. number of atoms) for test purposes on your laptop.

When running on the GPU, this code only seems to use about 165MB of GPU memory.

You have some loop-invariant operations in your inner loop here:

for(int qVecId=0; qVecId < qVecs_dim1; ++qVecId)
                    {
                        // Gets indices
                        int atom_tf_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr + dt);  // loop-invariant
                        int atom_t0_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr);       // loop-invariant

                        int qVec_idx = 3 * (qValId * qVecs_dim1 + qVecId);

                        // Computes distances for given timestep and atom
                        float dist_0 = atomPos[atom_tf_idx] - atomPos[atom_t0_idx];     // loop-invariant
                        float dist_1 = atomPos[atom_tf_idx+1] - atomPos[atom_t0_idx+1]; // loop-invariant
                        float dist_2 = atomPos[atom_tf_idx+2] - atomPos[atom_t0_idx+2]; // loop-invariant

by hoisting those out of the innermost loop, I was able to cut the kernel duration almost by a factor of 2 on my Tesla V100:

$ cat test.cu
#include <stdio.h>

#include <cuda.h>
#include <cuda_runtime.h>

#define BLOCK_SIZE 512

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__
void compIntScatFunc(float *atomPos, int atomPos_dim0, int atomPos_dim1, float *out,
                     float *qVecs, int qVecs_dim0, int qVecs_dim1,
                     int maxFrames, int nbrTimeOri)
{
    int atomId = blockDim.x * blockIdx.x + threadIdx.x;

    if( atomId < atomPos_dim0 )
    {
        extern __shared__ float s_qVecs[];

        for(int i=0; i < 3*qVecs_dim0*qVecs_dim1; ++i)
            s_qVecs[i] = qVecs[i];

        __syncthreads();

// Starts computation of intermediate scattering function
        for(int qValId=0; qValId < qVecs_dim0; ++qValId)
        {
            for(int dt=0; dt < maxFrames; ++dt)
            {
                int timeIncr = (int)( atomPos_dim1 - dt) / nbrTimeOri;

                for(int t0=0; t0 < nbrTimeOri; ++t0)
                {
                        // Gets indices
                        int atom_tf_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr + dt);
                        int atom_t0_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr);
                        // Computes distances for given timestep and atom
                        float dist_0 = atomPos[atom_tf_idx] - atomPos[atom_t0_idx];
                        float dist_1 = atomPos[atom_tf_idx+1] - atomPos[atom_t0_idx+1];
                        float dist_2 = atomPos[atom_tf_idx+2] - atomPos[atom_t0_idx+2];
                    for(int qVecId=0; qVecId < qVecs_dim1; ++qVecId)
                    {
                        // Gets indices
                        //int atom_tf_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr + dt);
                        //int atom_t0_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr);

                        int qVec_idx = 3 * (qValId * qVecs_dim1 + qVecId);

                        // Computes distances for given timestep and atom
                        //float dist_0 = atomPos[atom_tf_idx] - atomPos[atom_t0_idx];
                        //float dist_1 = atomPos[atom_tf_idx+1] - atomPos[atom_t0_idx+1];
                        //float dist_2 = atomPos[atom_tf_idx+2] - atomPos[atom_t0_idx+2];

float re = cos( s_qVecs[qVec_idx] * dist_0
                                        + s_qVecs[qVec_idx+1] * dist_1
                                        + s_qVecs[qVec_idx+2] * dist_2 );

                        float im = sin( s_qVecs[qVec_idx] * dist_0
                                        + s_qVecs[qVec_idx+1] * dist_1
                                        + s_qVecs[qVec_idx+2] * dist_2 );

                        atomicAdd( &out[2*(qValId*maxFrames + dt)], re / (nbrTimeOri*qVecs_dim1) );
                        atomicAdd( &out[2*(qValId*maxFrames + dt) + 1], im / (nbrTimeOri*qVecs_dim1) );

} // q vectors loop
                } // time origins loop
            } // time increments loop
        } // qVals loop
    } // condition on atom index
}

void cu_compIntScatFunc_wrapper(float *atomPos, int atomPos_dim0, int atomPos_dim1, int atomPos_dim2,
                                float *qVecs, int qVecs_dim0, int qVecs_dim1, int qVecs_dim2,
                                float *out, int maxFrames, int nbrTimeOri)
{
    // Copying atomPos matrix on GPU memory
    float *cu_atomPos;
    size_t size_atomPos = atomPos_dim0 * atomPos_dim1 * atomPos_dim2 * sizeof(float);
    cudaMalloc(&cu_atomPos, size_atomPos);
    cudaMemcpy(cu_atomPos, atomPos, size_atomPos, cudaMemcpyHostToDevice);

    // Copying qVecs matrix on GPU memory
    float *cu_qVecs;
    size_t size_qVecs = qVecs_dim0 * qVecs_dim1 * qVecs_dim2 * sizeof(float);
    cudaMalloc(&cu_qVecs, size_qVecs);
    cudaMemcpy(cu_qVecs, qVecs, size_qVecs, cudaMemcpyHostToDevice);

    // Copying out matrix on GPU memory
    float *cu_out;
    size_t size_out = 2 * qVecs_dim0 * maxFrames * sizeof(float);
    cudaMalloc(&cu_out, size_out);
    cudaMemcpy(cu_out, out, size_out, cudaMemcpyHostToDevice);

int nbrBlocks = ceil(atomPos_dim0 / BLOCK_SIZE);
    int sharedMemSize = sizeof(float) * 3 * (qVecs_dim0*qVecs_dim1);

    compIntScatFunc<<<nbrBlocks, BLOCK_SIZE, sharedMemSize>>>(cu_atomPos, atomPos_dim0, atomPos_dim1, cu_out,
                                                cu_qVecs, qVecs_dim0, qVecs_dim1, maxFrames, nbrTimeOri);
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );

    cudaMemcpy(out, cu_out, size_out, cudaMemcpyDeviceToHost);

    // Averages with the number of atoms
    for(int i=0; i < 2*qVecs_dim0*maxFrames; ++i)
    {
        out[i] /= atomPos_dim0;
    }

cudaFree(cu_atomPos);
    cudaFree(cu_qVecs);
    cudaFree(cu_out);
}

int main(){

  float *atomPos, *qVecs, *out;
  int atomPos_dim0 = 12212;
  int atomPos_dim1 = 1000;
  int atomPos_dim2 = 3;
  int qVecs_dim0   = 9;
  int qVecs_dim1   = 15;
  int qVecs_dim2   = 3;
  int maxFrames    = 1000;
  int out_dim      = 2*qVecs_dim0*maxFrames;
  int nbrTimeOri   = 50;
  atomPos = (float *)malloc(atomPos_dim0*atomPos_dim1*atomPos_dim2*sizeof(atomPos[0]));
  qVecs   = (float *)malloc(qVecs_dim0*qVecs_dim1*qVecs_dim2*sizeof(qVecs[0]));
  out     = (float *)malloc(out_dim*sizeof(out[0]));
  memset(out,0, out_dim*sizeof(out[0]));
  cu_compIntScatFunc_wrapper(atomPos, atomPos_dim0, atomPos_dim1, atomPos_dim2,
                             qVecs, qVecs_dim0, qVecs_dim1, qVecs_dim2,
                             out, maxFrames, nbrTimeOri);
  return 0;
}
$ nvcc -arch=sm_70 -o test test.cu
$ nvprof ./test
==30181== NVPROF is profiling process 30181, command: ./test
==30181== Profiling application: ./test
==30181== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.76%  16.5079s         1  16.5079s  16.5079s  16.5079s  compIntScatFunc(float*, int, int, float*, float*, int, int, int, int)
                    0.24%  39.023ms         3  13.008ms  1.8560us  39.013ms  [CUDA memcpy HtoD]
                    0.00%  7.5850us         1  7.5850us  7.5850us  7.5850us  [CUDA memcpy DtoH]
      API calls:   97.80%  16.5079s         1  16.5079s  16.5079s  16.5079s  cudaDeviceSynchronize
                    1.90%  320.03ms         3  106.68ms  9.5310us  319.78ms  cudaMalloc
                    0.23%  39.442ms         4  9.8605ms  17.236us  39.291ms  cudaMemcpy
                    0.03%  5.4108ms         4  1.3527ms  688.79us  3.3278ms  cuDeviceTotalMem
                    0.03%  4.9659ms       388  12.798us     337ns  520.84us  cuDeviceGetAttribute
                    0.00%  717.06us         3  239.02us  13.428us  494.40us  cudaFree
                    0.00%  452.62us         4  113.16us  102.36us  136.64us  cuDeviceGetName
                    0.00%  61.000us         1  61.000us  61.000us  61.000us  cudaLaunchKernel
                    0.00%  25.141us         4  6.2850us  3.2550us  11.006us  cuDeviceGetPCIBusId
                    0.00%  18.264us         3  6.0880us  1.5980us  13.711us  cuDeviceGetCount
                    0.00%  7.9940us         8     999ns     470ns  1.7490us  cuDeviceGet
                    0.00%  2.7780us         4     694ns     630ns     854ns  cuDeviceGetUuid
                    0.00%  1.4850us         1  1.4850us  1.4850us  1.4850us  cudaPeekAtLastError
$

interestingly, replacing the sin and cos calculations with sincosf did not help execution time, nor did removing the division operations that were scaling the atomicAdd quantities. These several datapoints suggest that your kernel is memory-bound, which is not surprising.

Following the loop-invariant theme, we can note that the output location index calculation:

atomicAdd( &out[2*(qValId*maxFrames + dt)], re / (nbrTimeOri*qVecs_dim1) );
atomicAdd( &out[2*(qValId*maxFrames + dt) + 1], im / (nbrTimeOri*qVecs_dim1) );
                ^^^^^^^^^^^^^^^^^^^^^^^^^

is invariant over the innermost 2 loops. We can exploit this to remove the actual atomic add operations out of the innermost 2 loops, and instead keep a per-thread running sum during the innermost 2 loops, updating the global variable at the end of those 2 inner loops. At this point we have removed enough memory-bounded-ness that my previous two suggestions can be implemented for benefit. The kernel has reduced from ~30s duration to less than 2s duration on my tesla V100:

$ cat test.cu
#include <stdio.h>

#include <cuda.h>
#include <cuda_runtime.h>

#define BLOCK_SIZE 512

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__
void compIntScatFunc(float *atomPos, int atomPos_dim0, int atomPos_dim1, float *out,
                     float *qVecs, int qVecs_dim0, int qVecs_dim1,
                     int maxFrames, int nbrTimeOri)
{
    int atomId = blockDim.x * blockIdx.x + threadIdx.x;

    if( atomId < atomPos_dim0 )
    {
        extern __shared__ float s_qVecs[];

        for(int i=0; i < 3*qVecs_dim0*qVecs_dim1; ++i)
            s_qVecs[i] = qVecs[i];

        __syncthreads();

// Starts computation of intermediate scattering function
        for(int qValId=0; qValId < qVecs_dim0; ++qValId)
        {
            for(int dt=0; dt < maxFrames; ++dt)
            {
                int timeIncr = (int)( atomPos_dim1 - dt) / nbrTimeOri;
                float out_re = 0;
                float out_im = 0;

                for(int t0=0; t0 < nbrTimeOri; ++t0)
                {
                        // Gets indices
                        int atom_tf_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr + dt);
                        int atom_t0_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr);
                        // Computes distances for given timestep and atom
                        float dist_0 = atomPos[atom_tf_idx] - atomPos[atom_t0_idx];
                        float dist_1 = atomPos[atom_tf_idx+1] - atomPos[atom_t0_idx+1];
                        float dist_2 = atomPos[atom_tf_idx+2] - atomPos[atom_t0_idx+2];
                    for(int qVecId=0; qVecId < qVecs_dim1; ++qVecId)
                    {
                        // Gets indices
                        //int atom_tf_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr + dt);
                        //int atom_t0_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr);

                        int qVec_idx = 3 * (qValId * qVecs_dim1 + qVecId);

                        // Computes distances for given timestep and atom
                        //float dist_0 = atomPos[atom_tf_idx] - atomPos[atom_t0_idx];
                        //float dist_1 = atomPos[atom_tf_idx+1] - atomPos[atom_t0_idx+1];
                        //float dist_2 = atomPos[atom_tf_idx+2] - atomPos[atom_t0_idx+2];

                        float re,im;

                        sincosf( s_qVecs[qVec_idx] * dist_0
                                        + s_qVecs[qVec_idx+1] * dist_1
                                        + s_qVecs[qVec_idx+2] * dist_2,&re,&im);

                        out_re += re;
                        out_im += im;
                        //atomicAdd( &out[2*(qValId*maxFrames + dt)], re / (nbrTimeOri*qVecs_dim1) );
                        //atomicAdd( &out[2*(qValId*maxFrames + dt) + 1], im / (nbrTimeOri*qVecs_dim1) );

} // q vectors loop
                } // time origins loop
                        atomicAdd( &out[2*(qValId*maxFrames + dt)],     out_re / (nbrTimeOri*qVecs_dim1) );
                        atomicAdd( &out[2*(qValId*maxFrames + dt) + 1], out_im / (nbrTimeOri*qVecs_dim1) );
            } // time increments loop
        } // qVals loop
    } // condition on atom index
}

void cu_compIntScatFunc_wrapper(float *atomPos, int atomPos_dim0, int atomPos_dim1, int atomPos_dim2,
                                float *qVecs, int qVecs_dim0, int qVecs_dim1, int qVecs_dim2,
                                float *out, int maxFrames, int nbrTimeOri)
{
    // Copying atomPos matrix on GPU memory
    float *cu_atomPos;
    size_t size_atomPos = atomPos_dim0 * atomPos_dim1 * atomPos_dim2 * sizeof(float);
    cudaMalloc(&cu_atomPos, size_atomPos);
    cudaMemcpy(cu_atomPos, atomPos, size_atomPos, cudaMemcpyHostToDevice);

    // Copying qVecs matrix on GPU memory
    float *cu_qVecs;
    size_t size_qVecs = qVecs_dim0 * qVecs_dim1 * qVecs_dim2 * sizeof(float);
    cudaMalloc(&cu_qVecs, size_qVecs);
    cudaMemcpy(cu_qVecs, qVecs, size_qVecs, cudaMemcpyHostToDevice);

    // Copying out matrix on GPU memory
    float *cu_out;
    size_t size_out = 2 * qVecs_dim0 * maxFrames * sizeof(float);
    cudaMalloc(&cu_out, size_out);
    cudaMemcpy(cu_out, out, size_out, cudaMemcpyHostToDevice);

int nbrBlocks = ceil(atomPos_dim0 / BLOCK_SIZE);
    int sharedMemSize = sizeof(float) * 3 * (qVecs_dim0*qVecs_dim1);

    compIntScatFunc<<<nbrBlocks, BLOCK_SIZE, sharedMemSize>>>(cu_atomPos, atomPos_dim0, atomPos_dim1, cu_out,
                                                cu_qVecs, qVecs_dim0, qVecs_dim1, maxFrames, nbrTimeOri);
    gpuErrchk( cudaPeekAtLastError() );
    gpuErrchk( cudaDeviceSynchronize() );

    cudaMemcpy(out, cu_out, size_out, cudaMemcpyDeviceToHost);

    // Averages with the number of atoms
    for(int i=0; i < 2*qVecs_dim0*maxFrames; ++i)
    {
        out[i] /= atomPos_dim0;
    }

cudaFree(cu_atomPos);
    cudaFree(cu_qVecs);
    cudaFree(cu_out);
}

int main(){

  float *atomPos, *qVecs, *out;
  int atomPos_dim0 = 12212;
  int atomPos_dim1 = 1000;
  int atomPos_dim2 = 3;
  int qVecs_dim0   = 9;
  int qVecs_dim1   = 15;
  int qVecs_dim2   = 3;
  int maxFrames    = 1000;
  int out_dim      = 2*qVecs_dim0*maxFrames;
  int nbrTimeOri   = 50;
  atomPos = (float *)malloc(atomPos_dim0*atomPos_dim1*atomPos_dim2*sizeof(atomPos[0]));
  qVecs   = (float *)malloc(qVecs_dim0*qVecs_dim1*qVecs_dim2*sizeof(qVecs[0]));
  out     = (float *)malloc(out_dim*sizeof(out[0]));
  memset(out,0, out_dim*sizeof(out[0]));
  cu_compIntScatFunc_wrapper(atomPos, atomPos_dim0, atomPos_dim1, atomPos_dim2,
                             qVecs, qVecs_dim0, qVecs_dim1, qVecs_dim2,
                             out, maxFrames, nbrTimeOri);
  return 0;
}
$ nvcc -arch=sm_70 -o test test.cu
$ nvprof ./test
==31380== NVPROF is profiling process 31380, command: ./test
==31380== Profiling application: ./test
==31380== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   97.90%  1.82299s         1  1.82299s  1.82299s  1.82299s  compIntScatFunc(float*, int, int, float*, float*, int, int, int, int)
                    2.10%  39.065ms         3  13.022ms  1.8560us  39.054ms  [CUDA memcpy HtoD]
                    0.00%  7.4890us         1  7.4890us  7.4890us  7.4890us  [CUDA memcpy DtoH]
      API calls:   83.40%  1.82302s         1  1.82302s  1.82302s  1.82302s  cudaDeviceSynchronize
                   14.20%  310.39ms         3  103.46ms  9.4380us  310.13ms  cudaMalloc
                    1.81%  39.494ms         4  9.8734ms  18.425us  39.345ms  cudaMemcpy
                    0.29%  6.3763ms       388  16.433us     327ns  1.0507ms  cuDeviceGetAttribute
                    0.25%  5.4088ms         4  1.3522ms  678.82us  3.3299ms  cuDeviceTotalMem
                    0.03%  711.37us         3  237.12us  13.217us  491.27us  cudaFree
                    0.02%  457.97us         4  114.49us  101.89us  141.75us  cuDeviceGetName
                    0.00%  60.574us         1  60.574us  60.574us  60.574us  cudaLaunchKernel
                    0.00%  23.825us         4  5.9560us  3.2940us  10.578us  cuDeviceGetPCIBusId
                    0.00%  7.9490us         8     993ns     550ns  1.6350us  cuDeviceGet
                    0.00%  5.7940us         3  1.9310us     542ns  3.6520us  cuDeviceGetCount
                    0.00%  2.7430us         4     685ns     560ns     810ns  cuDeviceGetUuid
                    0.00%  1.6270us         1  1.6270us  1.6270us  1.6270us  cudaPeekAtLastError
$

This improved kernel may possibly run in a minute or less on your GPU. It may be worth a try with TDR disabled for ~90 seconds or so.

There are probably additional optimizations that might be made. For example, you have a somewhat inefficient array-of-structures (AoS) storage methodology, which leads to somewhat inefficient memory access for lines such as this:

float dist_0 = atomPos[atom_tf_idx] - atomPos[atom_t0_idx];
float dist_1 = atomPos[atom_tf_idx+1] - atomPos[atom_t0_idx+1];
float dist_2 = atomPos[atom_tf_idx+2] - atomPos[atom_t0_idx+2];

by reworking for example your atomPos storage to SoA, there might be some gains to be had. Once all the obvious optimization strategies have been tested, then you could use a profiler to focus your efforts for the next round of optimizations on this kernel.

Your global memory load patterns are also inefficient due to your indexing:

int atomId = blockDim.x * blockIdx.x + threadIdx.x;
int atom_t0_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr);
                  ^^^^^^^^^^^^^^^^^^^^^^^^

which the profiler will confirm. But before tackling any particular item, I would use the profiler to identify the most significant limiter to your kernel.

Another possible performance limiter in this code is the fact that you are only launching 23 blocks of 512 threads each. That is enough to “saturate” a relatively small GPU of 6 SMs or less, but for larger GPUs it will not come close to enough to saturate the GPU.

It should be relatively straightforward to distribute the first loop across multiple threadblocks, leaving 3 nested loops with 9*23 threadblocks total. This is still not enough to saturate a V100, but it may be an improvement, depending on the GPU you are running on.

It seems there may be another error in your code, in terms of sizing the grid:

int nbrBlocks = ceil(atomPos_dim0 / BLOCK_SIZE);
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
                         integer division

In order to get your desired round-up intent here, one of the internal arguments would first need to be cast to floating-point, before the division operation.

And while we are at it, your use of __syncthreads() is potentially illegal also, for one block in the grid.

Wow, thanks a lot for this detailed help.

Indeed, my mistake, ‘out’ shape is (9, 1900).

Now it’s running fine, I’ve made some of your suggested modifications. Also I’ve removed the cudaMemcpy for ‘out’, from host to device, as it is useless actually.
Now I’m using a nbrTS variable, that basically gives the number of time increment to be used, and thus the second dimension of ‘out’ array.
It runs better if I take the first loop out of the kernel as you said.

This might still need some optimizations. But it is already running way faster than my CPU only version. I’ve put the nvprof result below. It takes around 8-9 seconds, using nbrTS = 100, nbrTimeOri = 50 and the array shapes given before.

Now I can quicly get very nice neutron scattering spectra from my simulations.

==3732== Profiling application: c:\users\kevin pounot\appdata\local\programs\python\python37\python.exe  C:\Users\Kevin Pounot\AppData
\Local\Programs\Python\Python37\Scripts\ipython.exe
==3732== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   98.70%  8.05955s         9  895.51ms  892.96ms  900.15ms  compIntScatFunc(float*, int, int, float*, float*, int, in
t, int, int, int)
                    1.30%  106.54ms         2  53.270ms  1.1840us  106.54ms  [CUDA memcpy HtoD]
                    0.00%  2.4640us         1  2.4640us  2.4640us  2.4640us  [CUDA memcpy DtoH]
      API calls:   96.70%  8.06073s         9  895.64ms  893.08ms  900.32ms  cudaDeviceSynchronize
                    2.00%  166.71ms         3  55.570ms  21.800us  165.73ms  cudaMalloc
                    1.27%  106.23ms         3  35.410ms  138.60us  105.50ms  cudaMemcpy
                    0.01%  1.1582ms         3  386.07us  78.600us  872.00us  cudaFree
                    0.01%  461.50us         9  51.277us  43.100us  61.100us  cudaLaunchKernel
                    0.00%  355.30us        97  3.6620us     200ns  240.20us  cuDeviceGetAttribute
                    0.00%  37.300us         1  37.300us  37.300us  37.300us  cuDeviceTotalMem
                    0.00%  9.8000us         1  9.8000us  9.8000us  9.8000us  cuDeviceGetPCIBusId
                    0.00%  4.6000us         2  2.3000us     400ns  4.2000us  cuDeviceGet
                    0.00%  2.0000us         3     666ns     300ns  1.1000us  cuDeviceGetCount
                    0.00%  1.4000us         1  1.4000us  1.4000us  1.4000us  cuDeviceGetName
                    0.00%     500ns         1     500ns     500ns     500ns  cuDeviceGetLuid
                    0.00%     400ns         1     400ns     400ns     400ns  cuDeviceGetUuid

Here is the updated code:

#include <stdio.h>

#include <cuda.h>
#include <cuda_runtime.h>

#define BLOCK_SIZE 256

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__
void compIntScatFunc(float *atomPos, int atomPos_dim0, int atomPos_dim1, float *out,
                     float *qVecs, int qVecs_dim0, int qVecs_dim1, 
                     int nbrTS, int nbrTimeOri, int qValId) 
{
    int atomId  = blockDim.x * blockIdx.x + threadIdx.x;
    int TSIncr  = (atomPos_dim1 / nbrTS);

    if( atomId < atomPos_dim0 )
    {
        extern __shared__ float s_qVecs[];

        for(int i=0; i < 3*qVecs_dim0*qVecs_dim1; ++i)
            s_qVecs[i] = qVecs[i];

        __syncthreads();


        for(int dt=0; dt < nbrTS; ++dt)
        {
            int timeIncr = (float)(atomPos_dim1 - dt*TSIncr) / nbrTimeOri; 

            float sum_re = 0;
            float sum_im = 0;

            for(int t0=0; t0 < nbrTimeOri; ++t0)
            {
                // Gets indices
                int atom_tf_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr + dt*TSIncr); 
                int atom_t0_idx = 3 * (atomId*atomPos_dim1 + t0*timeIncr);

                // Computes distances for given timestep and atom
                float dist_0 = atomPos[atom_tf_idx] - atomPos[atom_t0_idx];
                float dist_1 = atomPos[atom_tf_idx+1] - atomPos[atom_t0_idx+1];
                float dist_2 = atomPos[atom_tf_idx+2] - atomPos[atom_t0_idx+2];

                for(int qVecId=0; qVecId < qVecs_dim1; ++qVecId)
                {

                    int qVec_idx = 3 * (qValId * qVecs_dim1 + qVecId);

                    float re = cos( s_qVecs[qVec_idx] * dist_0 
                                    + s_qVecs[qVec_idx+1] * dist_1
                                    + s_qVecs[qVec_idx+2] * dist_2 );

                    float im = sin( s_qVecs[qVec_idx] * dist_0 
                                    + s_qVecs[qVec_idx+1] * dist_1
                                    + s_qVecs[qVec_idx+2] * dist_2 );

                    sum_re += re;
                    sum_im += im;

                } // q vectors loop
            } // time origins loop 

            atomicAdd( &(out[2*(qValId*nbrTS + dt)]), sum_re / (nbrTS*qVecs_dim1) );
            atomicAdd( &(out[2*(qValId*nbrTS + dt) + 1]), sum_im / (nbrTS*qVecs_dim1) );

        } // time increments loop

    } // condition on atom index

}





void cu_compIntScatFunc_wrapper(float *atomPos, int atomPos_dim0, int atomPos_dim1, int atomPos_dim2, 
                                float *qVecs, int qVecs_dim0, int qVecs_dim1, int qVecs_dim2, 
                                float *out, int nbrTS, int nbrTimeOri)
{
    // Copying atomPos matrix on GPU memory
    float *cu_atomPos;
    size_t size_atomPos = atomPos_dim0 * atomPos_dim1 * atomPos_dim2 * sizeof(float);
    gpuErrchk( cudaMalloc(&cu_atomPos, size_atomPos) );
    gpuErrchk( cudaMemcpy(cu_atomPos, atomPos, size_atomPos, cudaMemcpyHostToDevice) );

    // Copying qVecs matrix on GPU memory
    float *cu_qVecs;
    size_t size_qVecs = qVecs_dim0 * qVecs_dim1 * qVecs_dim2 * sizeof(float);
    gpuErrchk( cudaMalloc(&cu_qVecs, size_qVecs) );
    gpuErrchk( cudaMemcpy(cu_qVecs, qVecs, size_qVecs, cudaMemcpyHostToDevice) );

    // Copying out matrix on GPU memory
    float *cu_out;
    size_t size_out = 2 * qVecs_dim0 * nbrTS * sizeof(float);
    gpuErrchk( cudaMalloc(&cu_out, size_out) );

    int nbrBlocks = ceil((float)atomPos_dim0 / BLOCK_SIZE);
    int sharedMemSize = sizeof(float) * 3 * qVecs_dim0 * qVecs_dim1; 


    // Starts computation of intermediate scattering function
    for(int qValId=0; qValId < qVecs_dim0; ++qValId)
    {
        compIntScatFunc<<<nbrBlocks, BLOCK_SIZE, sharedMemSize>>>(cu_atomPos, atomPos_dim0, 
                                                            atomPos_dim1, cu_out, cu_qVecs, 
                                                            qVecs_dim0, qVecs_dim1, nbrTS, 
                                                            nbrTimeOri, qValId);
        gpuErrchk( cudaDeviceSynchronize() );
    }



    cudaMemcpy(out, cu_out, size_out, cudaMemcpyDeviceToHost);

    for(int i=0; i < 2*nbrTS*qVecs_dim0; ++i)
    {
        out[i] /= atomPos_dim0;
    }

    cudaFree(cu_atomPos);
    cudaFree(cu_qVecs);
    cudaFree(cu_out);
}

Actually you have introduced a bug by doing that.

Your kernel only adds to the out variable. It never initializes it (and it shouldn’t!)

This host code does not initialize out either:

gpuErrchk( cudaMalloc(&cu_out, size_out) );

So you have nothing to initialize out (i.e. cu_out). It may contain garbage. When you do an atomicAdd to it, you are adding to garbage.

You must initialize the device out variable before running your kernel. If you don’t need to initialize it from any host data, i.e. you just want to set it to zero, you can do

gpuErrchk( cudaMemset(cu_out, 0, size_out) );

right after the allocation line I indicated above, but before the kernel launch.

And you still have the issue with __syncthreads():

if( atomId < atomPos_dim0 )
    {
        extern __shared__ float s_qVecs[];

        for(int i=0; i < 3*qVecs_dim0*qVecs_dim1; ++i)
            s_qVecs[i] = qVecs[i];

        __syncthreads();  // there may be a block in which not all threads participate, due to if statement

for(int dt=0; dt < nbrTS; ++dt)

You can fix it like this:

extern __shared__ float s_qVecs[];
    if( atomId < atomPos_dim0 )
    {

for(int i=0; i < 3*qVecs_dim0*qVecs_dim1; ++i)
            s_qVecs[i] = qVecs[i];
    }

    __syncthreads();

    if( atomId < atomPos_dim0 )
    {

        for(int dt=0; dt < nbrTS; ++dt)

Thanks, I didn’t think about these ‘silent’ threads that might cause problems with __syncthreads(), I’ve updated the code.

I’ve corrected also the array initialization, but it’s a bit strange though.
The ‘original’ array is initialized in Python using numpy.zeros() function.
Then, I call the wrapper function from cythonized .cpp/.pyx whose only role is to call the function in the pre-processed (that is, the created .lib file) .cu file.

If I use a cudaMemcpy from this ‘out’ to the ‘cu_out’, I got very strange result, with intermediate scattering function that looks like nothing, and a lot of inf, nan or values to the power 37.
It looks like some strange stuff happens when Python arrays are passed to C/CUDA functions…

However, by removing it, I get the expected behavior, quite similar to the experimental measurement.
And I get the same result with the cudaMemset (I didn’t try to check the memory adresses though).

what are you using as the datatype for the numpy.zeros() function? Your CUDA code is mostly using 32-bit quantities (integer and float) whereas numpy/python may use 64-bit by default. If your array is all zeroes, I’m not sure this should matter, but I don’t have your full python code to look at.

Anyway, if you don’t do anything else, use the cudaMemset. Your code is invalid if you don’t initialize cu_out array at all before doing atomicAdd to it.

Sure, the cudaMemset line is there now.

For the python array, I specified the numpy.float32 as dtype.