Error code 716 keeps popping up in the project

I was going through the code at - project link while following the article of “accelerated ray tracing in one weekend using CUDA”. I am having some problems with executing my code at higher no of samples per pixels. (In the project the samples per pixel is depicted by “ns” in ./ file)

int main() {
int nx = 1200;
int ny = 800;
int ns = 10;

Well sometimes I am not able to run my application and error 716(misaligned address) keeps popping up. I tried to investigate this error and found that at higher samples per pixel(try ns = 100) the code almost never works.

So is this error happening because I am overloading my threads? And if yes can someone point out to some good materials about how object initialization in device annotated functions work, because after launching our render<<<>>> kernel in we are dealing with a lot of device functions and in each of them we are initializing a lot of objects and I wanted to know how the CUDA framework handles the memory allocation of such variables.

And also if I wanted to run this program at very high no of samples per pixel can someone point out the direction in which I should be working towards?

I am using NVIDIA Geforce GTX 1660 Ti and CUDA version 11.5

Ok so I tried to further investigate the problem with compute sanitizer and got the following error message -

========= Invalid __local__ write of size 4 bytes
=========     at 0x3e60 in D:/CudaProjects/raytracinginoneweekendincuda- 
ch12_where_next_cuda/material.h:52:render(vec3 *, int, int, int, camera 
**, hitable **, curandStateXORWOW *)
=========     by thread (4,1,0) in block (128,1,0)=========     Address 
0xfffb7a is misaligned
=========     Saved host backtrace up to driver entry point at kernel 
launch time
=========     Host Frame:cuEventRecordWithFlags [0x7ffbc0ecc7b5]
=========                in 
=========     Host Frame: [0x1d46]
=========                in 
-- More  --

Now the reason why I think I am overloading the threads because this error does not pop up for lower no of samples per pixel(ns) i.e - lesser iterations for each pixel. And the line 52 of materials.h where the error has popped up is a declaration of a ray object -

scattered = ray(rec.p, target-rec.p);

So that’s why I was concerned with the life cycle of initialized objects in a device function and wanted to know if after going out of scope i.e. after the function has finished executing do the memory space occupied by the objects initialized in the device function gets de allocated or not.

you may be running into a kernel timeout error

The behavior should be consistent with C++

Thank u for the reply , and I will surely look into the probability of a timeout, but I would like to point out that for some very few instances , like I am talking about a 1 in a 100 chance the solution has given positive results for 100 samples per pixel .The thing I am not able to understand is the sporadic behaviour of the program, like the same error pops up for smaller no of samples per pixel too, but the thing is they are very less frequent.