How to fix and debug misalignment errors that pop up sporadically?

Please refer to this github repo to gain some more insights on the error that I keep on getting - Accelerated ray tracing using cuda

Ok I know that just giving out a link to a repo is not the right way to ask your doubts , but I will try to localize the error as much as I can in my question -

So I have been enhancing and trying to run this code on my laptop - (GTX - 1660ti and Cuda v- 12) , well for most of the parts the code works perfectly fine but for larger no of iterations - i.e. for larger images or for higher no of samples per pixels , I used to get this sporadic error which gave off an Error - 716 code , now on running with compute-sanitizer at that time this kind of message popped up -

========= COMPUTE-SANITIZER
========= Invalid __local__ write of size 4 bytes
=========     at 0x3e60 in D:/CudaProjects/raytracinginoneweekendincuda- 
ch12_where_next_cuda/material.h:52:render(vec3 *, int, int, int, camera 
**, hitable **, curandStateXORWOW *)
=========     by thread (4,1,0) in block (128,1,0)=========     Address 
0xfffb7a is misaligned
=========     Saved host backtrace up to driver entry point at kernel 
launch time
=========     Host Frame:cuEventRecordWithFlags [0x7ffbc0ecc7b5]
=========                in 
C:\WINDOWS\system32\DriverStore\FileRepository
\nvam.inf_amd64_4c9ded46d0fbe1 
f8\nvcuda64.dll
=========     Host Frame: [0x1d46]
=========                in 
D:\CudaProjects\raytracinginoneweekendincuda-ch12_where_next_cuda\a.exe
-- More  --

Now this line in the repo is actually an initializtion of a ray object -

scattered = ray(rec.p, target-rec.p);

Well at that time after asking around a lot , and getting no where near to the root of the problem , I decided to move on to adding more features to the project , like adding more textures , materials new shapes , but the misalignment error never left me , It was always lurking in the background popping up here and there for higher no of iterations.

Now that I am reaching the end of my project I again tried to look into this error and found some very weird patterns -

Here is an error code which traces itself to the ray.direction() function in the code which was raised in the hit function of a traingle renderer that I coded (this may sound very annoying but I cannot share the whole code)
Here the grid size is - 512X512
and the block size is - 8X8
(For the render kernel)

========= COMPUTE-SANITIZER
Started creation and pre processing of data 
Pre processing and creation of the world took 0.077
Rendering a 512x512 image
========= Invalid __local__ read of size 8 bytes
=========     at 0x145a0 in C:/Users/sonas/Documents/Capstone 2022-23/raytracingcuda/ray.cuh:14:ray::direction() const
=========     by thread (4,1,0) in block (135,0,0)
=========     Address 0xfffae2 is misaligned
=========     Device Frame:C:/Users/sonas/Documents/Capstone 2022-23/raytracingcuda/triangle.cuh:49:render(vec3 *, int, int, int, camera **, hittable **, curandStateXORWOW *) [0x145a0]
=========     Device Frame:C:/Users/sonas/Documents/Capstone 2022-23/raytracingcuda/hittable_list.cuh:35:render(vec3 *, int, int, int, camera **, hittable **, curandStateXORWOW *) [0x86c0]
=========     Device Frame:C:/Users/sonas/Documents/Capstone 2022-23/raytracingcuda/main.cu:33:color(const ray &, hittable **, curandStateXORWOW *, vec3 &) [0xf40]
=========     Device Frame:C:/Users/sonas/Documents/Capstone 2022-23/raytracingcuda/main.cu:114:render(vec3 *, int, int, int, camera **, hittable **, curandStateXORWOW *) [0xa90]

Now this hit function is very similar to the hit function of the sphere renderer defined in the repo that I shared , and the line where this error popped up simply said -

vec3 r_d = r.direction();

Now it baffles me how a simple initialization can lead to a misaligment error and too an error which is sporadic in behaviour. now I tried multiple times with multiple shapes and found some interesting patterns -

  • This error always happens in the thread no (4,1,0) the block varies everytime but since it says “invalid local read” I guess it has something to do with the local memory defined for each thread?

  • This error always pops up where I am either initializing a ray object , or assigning a ray object or calling any of its data members - like r.direction()

  • This one is very weird but I tried calling r.origin() first to see if its all the data members that are acting up , but still the error popped up from the r.direction() function

Here are some of the things that I tried to fix this error -

  • Someone told me that in the vec3 class as I am using float[3] array , and as cuda supports reading of words in 1,2,4,8 or 16 bytes this mismatch may be causing the error , I tried changing this to float3 first and then just using three float variables, but none seemed to work for me and the error still stayed with similar messages

  • I tried using the align(16) notation before each class decleration to somehow fix this sporadic error(this was my very desperate move) and voila it still didnt work

  • I tried eliminating the member functions entirely and just tried to access the data object directly but still faced the same issues

Now as you can see by these methods of mine , I literally have no idea now to how actually even try to fix this problem , whichever article or answer i refer too, has some very clear misalignment issues like casting a variable to some other type of variable and on top of that they are not sporadic, I do not wish for a very clear cut answer but if someone could point me out to some resource or some article which can help me to learn and understand more about this error , I will forever be grateful.

I know it was a long question and thank you for reading it to the end , cheers!

the topic of natural alignment and what it means is discussed in the programming guide. You should start by understanding what it means. This error means you violated that requirement.

You already know how to use compute sanitizer to localize the fault to a particular line of code.

So the next step is to manually test in your code, whether the pointer you are about to dereference is properly aligned. If you start by understanding what the requirement is, you should be able to write some code to test the pointer before you use it. Put that code immediately prior to the faulting line. If you detect misalignment in your test code, then don’ allow the faulting line to actually execute. Instead, print out some data and exit, and use that to determine your next debug direction.

Follow the trail backwards until you find out where the misaligned pointer originated.

You may also wish to use the various sub-tools of compute-sanitizer, for example, to verify that you are not using any uninitialized global data.

For example,

Address 0xfffae2

could never be properly aligned for a 8-byte read. It’s a trivial matter to test that pointer and discover that, before it causes an actual fault. Once you have detected the problem, use in-kernel printf to print out whatever data you need to start the backward-tracing debug process. And exit at that point.

Thank you so much for your prompt reply @Robert_Crovella , I looked further as you instructed and would like to share my findings , well what suprised me is that when I used initcheck with compute-sanitizer for every thread I am getting this kind of error -

========= Uninitialized __global__ memory read of size 8 bytes
=========     at 0x110 in C:/Users/sonas/Documents/Capstone 2022-23/raytracingcuda/main.cu:100:render(vec3 *, int, int, int, camera **, hittable **, curandStateXORWOW *)
=========     by thread (2,5,0) in block (0,0,0)
=========     Address 0x71be00800
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:cuEventRecordWithFlags [0x7ffdc514e5c8]
=========                in C:\WINDOWS\system32\DriverStore\FileRepository\nvamui.inf_amd64_91804b01bce922dd\nvcuda64.dll
=========     Host Frame: [0x2a873]
=========                in C:\Users\sonas\Documents\Capstone 2022-23\raytracingcuda\exefiles\triangle_demo.exe
=========     Host Frame: [0x2a736]
=========                in C:\Users\sonas\Documents\Capstone 2022-23\raytracingcuda\exefiles\triangle_demo.exe
=========     Host Frame: [0x2a094]
=========                in C:\Users\sonas\Documents\Capstone 2022-23\raytracingcuda\exefiles\triangle_demo.exe
=========     Host Frame: [0x216eb]
=========                in C:\Users\sonas\Documents\Capstone 2022-23\raytracingcuda\exefiles\triangle_demo.exe
=========     Host Frame: [0x1e6f4]
=========                in C:\Users\sonas\Documents\Capstone 2022-23\raytracingcuda\exefiles\triangle_demo.exe

I checked and in main.cu::100 I am just derefrencing the curandState pointer to access a memory at some index - It goes like -

curandState local_state = rand_state[rand_state_index];

So the memory that I am pointing with rand_state I am initializing in a separate kernel and then just using the same pointer in my render kernel to access the curand_states (very similar to how in the repo its done)
For your refrence I am attaching the render_init kernel and the render kernel -
(I have first assigned memory to the rand_state pointer using cudamalloc and then just called render_init and render kernel simultaneously and have used the same rand_state pointer for both the kernels and also the dimensions of the render_init and render kernel are the same)

__global__ void render_init(int max_x, int max_y, curandState *rand_state) {
    int total_no_threads_in_a_block = blockDim.x*blockDim.y;
    int total_no_threads_in_a_row = total_no_threads_in_a_block*gridDim.x;
    int pixel_index = threadIdx.x + threadIdx.y*blockDim.x + total_no_threads_in_a_block*blockIdx.x+
    total_no_threads_in_a_row*blockIdx.y;
    curand_init(1984+pixel_index, 0, 0, &rand_state[pixel_index]);
}

__global__ void render(vec3 *anti_alias_pixel_arr,int nx,int ny,int ns,camera **d_camera
, hittable **d_world, curandState *rand_state)
{    
    if(blockIdx.x>=nx||blockIdx.y>=ny)
        return;
    int total_no_threads_in_a_block = blockDim.x*blockDim.y;
    int total_no_threads_in_a_row = total_no_threads_in_a_block*gridDim.x;
    int anti_alias_pixel_index = threadIdx.x + threadIdx.y*blockDim.x + total_no_threads_in_a_block*blockIdx.x+
    total_no_threads_in_a_row*blockIdx.y;
    int rand_state_index = anti_alias_pixel_index;


    //the error pops here
    curandState local_state = rand_state[rand_state_index];
    
    vec3 col(0,0,0);
    float u,v;
    ray r;
    
    u = float(blockIdx.x+curand_uniform(&local_state))/float(nx);
    v = float(blockIdx.y+curand_uniform(&local_state))/float(ny);
    
    vec3 background(0.70, 0.80, 1.00);
    
    // the ray object 
    r = (*d_camera)->get_ray(u,v,&local_state);
    col=color(r,d_world,&local_state,background);
    
    rand_state[rand_state_index] = local_state;
    anti_alias_pixel_arr[anti_alias_pixel_index] = col;
    
}

Now I have no idea what “uninitialized” global data really means? If its really uninitialized why am I able to use the “local_state” variable further on with my code? Am I missing something?

Also I looked into your suggesstion of further backtracing the problem to the creation of the misaligned “pointer” the thing is , the ray object that I am using is not a pointer but a simple ray object variable that I am explicitly initializing in my camera.h file (very similar to how its done in the repo , you can refer that I am also putting the code here) -

__device__ ray get_ray(float s, float t,curandState *rand_state) const {
            vec3 rd = lens_radius * random_in_unit_sphere(rand_state);
            vec3 offset = u * rd.x() + v * rd.y();

            return ray(
                origin + offset,
                lower_left_corner + s*horizontal + t*vertical - origin - offset
                ,random_double(rand_state,time0,time1)
            );
        }

Now the first time that this problem popped up was with local write and that too happened on a explicit constructor wherein I was trying to initialize my ray object. So is the problem has something to do with explicit constructors in the kernel ? Cause I saw that constructors never really run on the gpu , the memory is first assigned in the cpu and then copied back to our device for our usage.

Thank you again , for atleast giving me some hint on where I could be going wrong , it was of huge help ! cheers!

Hmmm.

Do you know what “global” means as far as GPU memory map is concerned? Data in the global space is typically data in a region that you allocate with cudaMalloc or cudaMallocManaged. It could also be pinned host memory and a few other cases.

Do you know what “uninitialized” means? It means you allocated an item, but never set it to anything. The first thing you did after you allocated the item is read from it. That makes no sense and could not possibly be a useful programming strategy.

I’m not going to spend any time trying to explain the behavior of code that is reading and using uninitialized data. The proper approach is to find out why it is uninitialized and fix it.

Thank you @Robert_Crovella for replying , sorry but it was never my intention to waste anyone’s time with any of my questions , but it is just that I don’t know that’s why I am asking.

Anyways to mitigate these errors I completely removed the rand_state pointer and started initiating my own local version of rand_state in the render kernel although I am still not sure why this works because before I was initiating my global rand_state in the render_init kernel and then using it in my render kernel and that is why I was suprised why it gave off an error of unintialized __global__ memory read, well after this too I was getting a lot of uninitialized __global__memory reads/writes anyways I tried to reproduce my errors by simulating a similar scenario as what is happening in my actual project -I am attaching the code -

#include <iostream>
#include <curand_kernel.h>

class __align__(16) vec3{
    int x,y,z;

    public:
        __device__ vec3(int a,int b,int c):
        x(a),y(b),z(c){};
};



class __align__(16) material{
    public:
        __device__ virtual bool scatter(vec3 &attentuation)const = 0;

};

struct __align__(16) record{
    material *ma;
};

class __align__(16) hittable{
    public:
        __device__ virtual bool hit(curandState *local_state,record &rec)const=0;
};

class __align__(16) textures{
    public:
    __device__ virtual vec3 value() const = 0;
};

class __align__(16) solid_color:public textures{
    vec3 color_value;

    public:

        __device__ solid_color(vec3 c):
        color_value(c){}
        

        __device__ virtual vec3 value()const override{
            return color_value;
        }


};

class __align__(16) lambertian : public material{
    textures *a;

    public:
        __device__ lambertian(vec3 tex):a(new solid_color(tex)){}

        __device__ virtual bool scatter(vec3 &attentuation)const override{
            attentuation = a->value();
        }

    
};


class __align__(16) triangle:public hittable{
    public:
    vec3 point_x,point_y,point_z;
    material *ma;
        __device__ triangle(vec3 p_x,vec3 p_y,vec3 p_z,material *m):
        point_x(p_x),point_y(p_y),point_z(p_z),ma(m){}
        __device__ virtual bool hit(curandState *local_state,record &rec)const override{
            rec.ma = ma;
            return true;
        }
};

class __align__(16) collection_of_triangles:public hittable{
    hittable **list;
    int size;

    public:
        __device__ collection_of_triangles(hittable **l,int s):
        list(l),size(s){}

        __device__ virtual bool hit(curandState *local_state,record &r)const override{
            int index = curand_uniform(local_state)*size;
            int ans=1;
            list[index]->hit(local_state,r);
            return true;
        }
};


__global__ void init_render(hittable **list,
hittable **c_o_t,int size)
{
    for(int i=0;i<size;i++)
    {
        list[i] = new triangle(vec3(1,1,1),vec3(1,1,1),vec3(1,1,1),
        new lambertian(vec3(1,1,1)));
    }

    *c_o_t = new collection_of_triangles(list,size);
}

__global__ void render_hit(hittable **c_o_t)
{
    int total_no_threads_in_a_block = blockDim.x*blockDim.y;
    int total_no_threads_in_a_row = total_no_threads_in_a_block*gridDim.x;
    int pixel_index = threadIdx.x + threadIdx.y*blockDim.x + total_no_threads_in_a_block*blockIdx.x+
    total_no_threads_in_a_row*blockIdx.y;
    curandState local_state;
    curand_init(1984+pixel_index, 0, 0, &local_state);
    vec3 ans(0,0,0);
    for(int i=0;i<51;i++)
    {
        record rec;
        if((*c_o_t)->hit(&local_state,rec))
        {
            rec.ma->scatter(ans);
        }
    }

}

int main(){
    hittable **triangle_list;
    hittable **collection;
    int size = 500;
    cudaMalloc(&triangle_list,sizeof(hittable)*size);
    cudaMalloc(&collection,sizeof(hittable));
    init_render<<<1,1>>>(triangle_list,collection,size);
    cudaDeviceSynchronize();
    render_hit<<<dim3(512,512),dim3(12,12)>>>(collection);
}

You can check with the initcheck tool that even though I am initializing all my data in the init_render kernel , I am still getting the uninitialized __global__ memory read error

I found that in my version of the code I was using __align__(16) with all of my class declarations(I saw in the cuda programming guide that this could probably help in any misalignment errors), and I am not sure why but removing it got me rid of any further uninitialized __global__ reads or write errors.

Your inputs were extremely valuable , and I would just like to ask you one more question , can uninitialized __global___ data especially in my case when I was forcing alignment using __align__(16) could cause misalignment errors?

Thank you for reading till the end and again I am extremely sorry If my queries and observations were too trivial or nonsensical in any sense, cheers!

I haven’t studied your code.

If you are reading uninitialized global data, and that data is somehow involved in an indexing calculation, then yes, it could definitely result in an alignment error.

You should be able to use the same method you already use to localize such an error to a specific line of code. You can then follow the instructions I already gave. And its fairly easy to look at a pointer in an error report and identify whether it is properly aligned for a specific data type, or not.

I will keep that in mind in the future , thank you !

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.