Accelerated Ray Tracing in One Weekend in CUDA

Originally published at: https://developer.nvidia.com/blog/accelerated-ray-tracing-cuda/

Recent announcements of NVIDIA’s new Turing GPUs, RTX technology, and Microsoft’s DirectX Ray Tracing have spurred a renewed interest in ray tracing. Using these technologies vastly simplifies the ability to write applications using ray tracing. But what if you’re curious about how ray tracing actually works? One way to learn is to code your own ray tracing…

Great Post just wanted to point out a typo you made at chapter 12.

"Using a new GeForce RTX 1080, the image renders in just 4.7 seconds for a 19x speedup!"

Was kinda funny :)

Good catch! We fixed it in the text, thanks.

What is fb? You keep referencing fb in your code but it is undefined when I try to run the code. I am just trying to get the code to run and show graphics.

Thank you for your question. It was an oversight to not show how I allocate the frame buffer (aka "fb") that holds the pixel RGB values. I’ll work with the editor to update the post. This is shown in the Chapter 1 branch of the reference code here: https://github.com/rogerall...

Thank you for your contribution. It is greatly appreciated by the CUDA community!

I was able to run your code successfully! On my CPU it takes 30 minutes to generate the image, but with GPU assistance using CUDA I'm down to approx 3 minutes. That's a really really big deal! I have a nvidia GeForce 1050 ti graphics card.

One last thing in regards to the code. On the later chapters, the CUDA code uses M_PI constant in the camera.h code. Please add the following snippet before the camera.h include so there are no build-time errors:

#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif

Again, thanks for your contribution. I will tinker with the code so I can learn ray tracing better. Have a great day!

-CUDA Education

I'm happy to hear you were successful in translating the code & your suggestion is a good one. I have found that I need that #define if I try out the code on Windows. Thanks!

Great post!
Somehow on my GTX 1050 and CUDA 9.2, my machine freezes completely at the end of the rendering. After a lot of trials I figured it happens when accessing fb on cpu while writing the image to disk. Copying the buffer to a host array resolved the issue for me.
Any ideas what could be causing this ?

The only thing I can think of is that you somehow removed the cudaDeviceSynchronize() after the render call & before the first read from fb. That could cause an issue like this.

If that wasn't it, suggest you try to find the simplest version of the code that doesn't work (try going back to Chapter 1 to see if that passes) and we could attempt to debug via an Issue on the github repository.

Let my try to simplify the code and still reproduce the issue, will open a github issue when I do. Thanks again!

Thanks very much for this !! I was hoping someone could give some insight into some issues I'm having in running this using VS 2017 on Windows10. I basically started a blank CUDA 9.1 Runtime project in VS 2017, then copied and pasted the header files and main.cu from the Chapter 12 code. I also selected the VS 2015 v140 platform, and under CUDA C/C++/Device in the Configuration Properties changed Code Generation to "compute_60,sm_60" to presumably match my GTX 1080ti. I then ran the debugger, and aside from the Intellisense warnings for the kernel launcher ("<<< >>>") and not recognizing "blockidx", it ran fine. And I sent the output to a .ppm file and dragged that into Gimp and the rendered image looked fine.

My problem is that it seems that it never utilized the GTX-1080ti, since the render (10 samples) took about 46 seconds, and Task Manager showed no significant GPU activity with either the 3D or CUDA (or any other) engines. Instead it seemed to bounce between using 2 of my CPU cores (8 - core Ryzen).

My other concern is that it never opened a window showing the realtime GPU render as it progressed, only printing the RGB values in the command window.

I'm guessing I missed a step in configuring the GPU in Visual Studio. Anyone have any insights? Thanks much.

Oh, and if anyone has any additional insight into adding a simple UI to this like ImGUI so I can manually tweak settings during the render I'd be thrilled....

I am not sure what could be happening with your VS 2017 project. If you see CUDA calls in the code it will use the GPU--there's no way around that. I've done something similar and gotten sub-10-second speeds with a GTX 1070. Maybe debug string output is super-slow? Make sure you create a Release executable would be one suggestion.

Note that there is no GUI for this project. It only outputs to the console.

BINGO !!!! 4.5 seconds !!! It was the Release vs. Debug as you mentioned. DOH !!! Thanks much.

Fantastic!

Dear Roger Allen,

Thank you very much
for your excellent implementation or ray tracing/casting in CUDA. I
was really surprised that you manged to use virtual functions from
and the abstract class hitable to create other classes. You manged to
derive hitable_list and sphere classes from the hitable abstract
class. All manuals on CUDA published by NVIDIA claim that it is
impossible to use virtual functions and abstract classes in order to
create other classes for usage in global kernel functions. Following
you steps I derived from hitable class other classes, like Plane,
Instance (for moving and rotation), Compound (for drawing complex
objects). All of them can be nested in each other. I even managed to
implement the class Grid for acceleration of computations called
regular grids described in the book Ray Tracing from the Ground up by
Keven Suffern. You employed the global kernel called Create World in
order to fill in GPU memory with objects. But when the number of
object was greater than 4000 spheres packed in the acceleration class
Grid I received the mistake

CUDA error = 700 at
main5_add_obj_grid_init.cu:256 'cudaDeviceSynchronize()'
(program exited with
code: 99)

Then I decided to
use unified memory cudaMallocManaged((void**)&d_world,
sizeof(hitable *)); and employed host function to instead of the
kernel function create_world and immediately received the error
message
number of objects=34
CUDA error = 700 at
main5_add_obj_grid_init.cu: 270 'cudaDeviceSynchronize()'

I have Quadro P4000
card with 8 gigabytes of memory. My host device has 32 gigabytes. I
use Ubuntu operating system and CUDA 10.1
Potentially, one can
use host functions to access and modify unified memory and then use
global kernel functions to process that unified memory . If I am
right my comp has in total 40 gigabytes of unified memory. I was
able to ray trace even several millions of spheres on my modest
laptop with 8 gigabytes without any CUDA card for this purpose. So ,
what was wrong in my case?
Best Regards,
Valery Pavlov,
Spain

Hi Valery,

I am going to have to look closer into my usage of virtual functions and the CUDA documentation and I will try to find some internal experts to help me understand. It was an oversight and not my intention to go against a CUDA rule. It looks like I may have stumbled into some undefined behavior. So, it may well be the case that my usage does not trigger errors, but your more elaborate usage does trigger errors as you seem to have found.

Now that I have read some posts like this one, I could see how the second case of creating objects on the host and copying them to the device would lose virtual function information and cause crashes.

For debugging your first issue that creates the objects on the device, since you have made significant changes, I think the best plan would be for you to post to the CUDA Programming forum so that others can try to reproduce and investigate the error.

Hi Roger I created the following topic:

https://devtalk.nvidia.com/...

in which I provide the small modification of your code provided by you for chapter 5:

Dear all ,

I have read a very
interesting post Accelerated Ray Tracing in One Weekend in CUDA by By
Roger Allen
(https://devblogs.nvidia.com....
In this post virtual functions were employed to create ray-traced
pictures of spheres. I modified the code provided in Chapter 5
(https://github.com/rogerall...

a little bit in
order to create_world on host instead of device. In other words I
replaced __device__ functions with __host__ __device__ functions
in headers ray.h; hitable.h; hitable_list.h; sphere.h; and finally
I added host function in main.cu file

void
create_world_host(hitable **d_list, hitable **d_world) {
*(d_list)= new sphere(vec3(0,0,-1), 0.5);

*(d_list+1)
= new sphere(vec3(0,-100.5,-1), 100);

*d_world
= new hitable_list(d_list,2);

}

and used unified
memory

// make our
world of hitables

//but using
unified memory

hitable
**d_list;

checkCudaErrors(cudaMallocManaged((void **)&d_list,
2*sizeof(hitable *)));

hitable
**d_world;

checkCudaErrors(cudaMallocManaged((void **)&d_world,
sizeof(hitable *)));)

I found it is
impossible to fill in the unified memory using my host function
create_world_host instead of similar __global__ void create_world . I
have Quadro P4000 card with Pascal architecture.
Why does usage of
the host function create_world_host instead of __global__ void
create_world lead to the error : endering
a 1200x600 image in 8x8 blocks.
CUDA error = 700 at
main.cu:114 'cudaDeviceSynchronize()' ????

My version of ray.h
is

#ifndef RAYH
#define RAYH
#include "vec3.h"
class ray
{

public:

__host__
__device__ ray() {}

__host__
__device__ ray(const vec3& a, const vec3& b) { A = a; B = b;
}

__host__
__device__ vec3 origin() const { return A; }

__host__
__device__ vec3 direction() const { return B; }

__host__
__device__ vec3 point_at_parameter(float t) const { return A + t*B; }

vec3 A;

vec3 B;

};

#endif

My version of
hitable.h is

#ifndef HITABLEH

#define HITABLEH

#include "ray.h"

struct hit_record

{

float t;

vec3 p;

vec3 normal;

};

class hitable {

public:

__host__
__device__ virtual bool hit(const ray& r, float t_min, float
t_max, hit_record& rec) const = 0;

};

#endif

My version of
hitable_list.h is:

#ifndef
HITABLELISTH

#define HITABLELISTH

#include "hitable.h"

class hitable_list:
public hitable {

public:

__host__
__device__ hitable_list() {}

__host__
__device__ hitable_list(hitable **l, int n) {list = l; list_size = n;
}

__host__
__device__ virtual bool hit(const ray& r, float tmin, float tmax,
hit_record& rec) const;

hitable
**list;

int
list_size;

};

__host__ __device__
bool hitable_list::hit(const ray& r, float t_min, float t_max,
hit_record& rec) const {

hit_record
temp_rec;

bool
hit_anything = false;

float
closest_so_far = t_max;

for (int i =
0; i < list_size; i++) {

if
(list[i]->hit(r, t_min, closest_so_far, temp_rec)) {

hit_anything = true;

closest_so_far = temp_rec.t;

rec
= temp_rec;

}

}

return
hit_anything;

}

#endif

My version of
sphere.h is:

#ifndef SPHEREH

#define SPHEREH

#include "hitable.h"

class sphere: public
hitable {

public:

__host__
__device__ sphere() {}

__host__
__device__ sphere(vec3 cen, float r) : center(cen), radius(r) {};

__host__
__device__ virtual bool hit(const ray& r, float tmin, float tmax,
hit_record& rec) const;

vec3 center;

float
radius;

};

__host__ __device__
bool sphere::hit(const ray& r, float t_min, float t_max,
hit_record& rec) const {

vec3 oc =
r.origin() - center;

float a =
dot(r.direction(), r.direction());

float b =
dot(oc, r.direction());

float c =
dot(oc, oc) - radius*radius;

float
discriminant = b*b - a*c;

if (discriminant
> 0) {

float temp =
(-b - sqrt(discriminant))/a;

if (temp <
t_max && temp > t_min) {

rec.t =
temp;

rec.p =
r.point_at_parameter(rec.t);

rec.normal = (rec.p - center) / radius;

return
true;

}

temp = (-b +
sqrt(discriminant)) / a;

if (temp <
t_max && temp > t_min) {

rec.t =
temp;

rec.p =
r.point_at_parameter(rec.t);

rec.normal = (rec.p - center) / radius;

return
true;

}

}

return false;

}

#endif

and finally my
version of main.cu:

#include <iostream>

#include <time.h>

#include <float.h>

#include "vec3.h"

#include "ray.h"

#include "sphere.h"

#include
"hitable_list.h"

// limited version
of checkCudaErrors from helper_cuda.h in CUDA examples

#define
checkCudaErrors(val) check_cuda( (val), #val, __FILE__, __LINE__ )

void
check_cuda(cudaError_t result, char const *const func, const char
*const file, int const line) {

if (result) {

std::cerr <<
"CUDA error = " << static_cast<unsigned int="">(result) << " at " <<

file <<
":" << line << " '" << func <<
"' \n";

// Make sure
we call CUDA Device Reset before exiting

cudaDeviceReset();

exit(99);

}

}

__device__ vec3
color(const ray& r, hitable **world) {

hit_record rec;

if
((*world)->hit(r, 0.0, FLT_MAX, rec)) {

return
0.5f*vec3(rec.normal.x()+1.0f, rec.normal.y()+1.0f,
rec.normal.z()+1.0f);

}

else {

vec3
unit_direction = unit_vector(r.direction());

float t =
0.5f*(unit_direction.y() + 1.0f);

return
(1.0f-t)*vec3(1.0, 1.0, 1.0) + t*vec3(0.5, 0.7, 1.0);

}

}

__global__ void
render(vec3 *fb, int max_x, int max_y,

vec3 lower_left_corner, vec3 horizontal, vec3 vertical, vec3
origin,

hitable **world) {

int i =
threadIdx.x + blockIdx.x * blockDim.x;

int j =
threadIdx.y + blockIdx.y * blockDim.y;

if((i >=
max_x) || (j >= max_y)) return;

int pixel_index
= j*max_x + i;

float u =
float(i) / float(max_x);

float v =
float(j) / float(max_y);

ray r(origin,
lower_left_corner + u*horizontal + v*vertical);

fb[pixel_index]
= color(r, world);

}

__global__ void
create_world(hitable **d_list, hitable **d_world) {

if (threadIdx.x
== 0 && blockIdx.x == 0) {

*(d_list)
= new sphere(vec3(0,0,-1), 0.5);

*(d_list+1)
= new sphere(vec3(0,-100.5,-1), 100);

*d_world
= new hitable_list(d_list,2);

}

}

void
create_world_host(hitable **d_list, hitable **d_world) {

*(d_list)
= new sphere(vec3(0,0,-1), 0.5);

*(d_list+1)
= new sphere(vec3(0,-100.5,-1), 100);

*d_world
= new hitable_list(d_list,2);

}

__global__ void
free_world(hitable **d_list, hitable **d_world) {

delete
*(d_list);

delete
*(d_list+1);

delete *d_world;

}

int main() {

int nx = 1200;

int ny = 600;

int tx = 8;

int ty = 8;

std::cerr <<
"Rendering a " << nx << "x" <<
ny << " image ";

std::cerr <<
"in " << tx << "x" << ty <<
" blocks.\n";

int num_pixels =
nx*ny;

size_t fb_size =
num_pixels*sizeof(vec3);

// allocate FB

vec3 *fb;

checkCudaErrors(cudaMallocManaged((void **)&fb, fb_size));

// make our
world of hitables

//but using
unified memory

hitable
**d_list;

checkCudaErrors(cudaMallocManaged((void **)&d_list,
2*sizeof(hitable *)));

hitable
**d_world;

checkCudaErrors(cudaMallocManaged((void **)&d_world,
sizeof(hitable *)));

//***Creation of
world on device using unified memory works***//

create_world<<<1,1>>>(d_list,d_world);

//***Creation of
world on host using unified memory fails***//

//
create_world_host(d_list,d_world);

checkCudaErrors(cudaGetLastError());

checkCudaErrors(cudaDeviceSynchronize());

clock_t start,
stop;

start = clock();

// Render our
buffer

dim3
blocks(nx/tx+1,ny/ty+1);

dim3
threads(tx,ty);

render<<<blocks, threads="">>>(fb, nx, ny,

vec3(-2.0, -1.0, -1.0),

vec3(4.0, 0.0, 0.0),

vec3(0.0, 2.0, 0.0),

vec3(0.0, 0.0, 0.0),

d_world);

checkCudaErrors(cudaGetLastError());

checkCudaErrors(cudaDeviceSynchronize());

stop = clock();

double
timer_seconds = ((double)(stop - start)) / CLOCKS_PER_SEC;

std::cerr <<
"took " << timer_seconds << " seconds.\n";

// Output FB as
Image

std::cout <<
"P3\n" << nx << " " << ny <<
"\n255\n";

for (int j =
ny-1; j >= 0; j--) {

for (int i =
0; i < nx; i++) {

size_t
pixel_index = j*nx + i;

int ir =
int(255.99*fb[pixel_index].r());

int ig =
int(255.99*fb[pixel_index].g());

int ib =
int(255.99*fb[pixel_index].b());

std::cout << ir << " " << ig << "
" << ib << "\n";

}

}

// clean up

checkCudaErrors(cudaDeviceSynchronize());

free_world<<<1,1>>>(d_list,d_world);

checkCudaErrors(cudaGetLastError());

checkCudaErrors(cudaFree(d_list));

checkCudaErrors(cudaFree(d_world));

checkCudaErrors(cudaFree(fb));

// useful for
cuda-memcheck --leak-check full

cudaDeviceReset();

I'm happy to see this is already getting replies. Suggest further followups go there for clarity.

To clarify, I did discuss a bit with some internal experts and found for certain that my code is not relying on any undefined behavior (UB).

I was worried about this statement:

It is not allowed to pass as an argument to a __global__ function an object of a class with virtual functions.

from https://docs.nvidia.com/cud...

Passing an object by value could trigger that UB, but since I am passing a pointer to an object that is both created & used on the device, the code works as expected. Internally, we are also discussing how to best update the programming guide for better clarity.

Further discussion should probably go on the devtalk forum post that Valery created in his response to the previous post.