Do any gurus here have advice for how to write this kernel? (math related)

MutantJohn · February 10, 2015, 4:50pm

This is the equation I’m trying to solve. It’s the one at the end of that section, solving for { t, u, v }.

How do I write efficient CUDA to solve that? In the past, I’ve written “naive” implementations where I’ll just brute force write out a 3x3 matrix inverse and then hand-write the multiplications, etc. Is that a good way of coding? It didn’t feel right when I was writing it. Like, if I was to write a publication, I’d be a little scared.

Oh, I can’t believe I almost forgot. How I’m loading this data is important too!

Okay, so this should be a rough breakdown of the kernel :

struct tetrahedron { int v[4]; /* possibly non-aggregate stuff */};

__global__
void line_plane_intersection(/* ... */)
{
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid >= n || tid == 0)
        return;

    if (128_bit_hash[tid - 1] == 128_bit_hash[tid])
    {
        const int ta_id = which_tetrahedron[tid];
        const int4 t = ((int4* ) mesh)[ta_id];

        const int po_id = point_opposite[tid];

        float3 p0, p1, p2;
        float3 la, lb;

        switch (pa_id)
        {
            /* cases 0 through 3 assign values to p0, p1, p2, lb, lb */
            case 0 :
                p0 = ((float3* ) points)[t.w];
                p1 = ((float3* ) points)[t.z];
                p2 = ((float3* ) points)[t.y];
                la = ((float3* ) points)[t.x];
                // lb's assignment is more complex than I thought...
                break;

            /* ... */
        }

        /* guess math should go here */
    }
}

njuffa · February 10, 2015, 4:59pm

Generally speaking, using a solver is numerically superior to using a multiplication with the inverse matrix, but depending on the application, the latter may be OK.

I would suggest taking a look at the batched operations in CUBLAS. As far as I recall, there are at least batched LU decomposition, batched matrix inverse, and batched GEMM. The purpose of batched operations is the efficient handling of small matrices.

little_jimmy · February 10, 2015, 5:34pm

the topic reads “do any gurus…” naturally, that rules me out

still, i am curious; what kind of dimensions do you normally use with global void line_plane_intersection() - i.e. how many intersections - or simply ‘thingies’ - do you ‘interrogate’ at a time?

and why is this:
“lb’s assignment is more complex than I thought…”

because int4 has 4 int and you require 5…?

MutantJohn · February 10, 2015, 5:44pm

It’s because lb comes from a different tetrahedron. I’m indexing each face of a tetrahedron using the index of the point opposite. Internally, I store things as v[0], etc. so having a point_opposite[tid] value of 0 means the opposite face is points 1, 2, 3.

As of now, I’m planning on using a small tuple { 128_bit_hash, which_tetrahedron, point_opposite }

Accessing the point for lb I think would be much more inefficient. I only need the value of the tetrahedron at the point_opposite value (point_opposite[i] is always between 0 and 3).

I guess I could maybe write

lb = ((float3* ) points)[(int* )(mesh + which_tetrahedron[tid - 1])[point_opposite[tid - 1]]];

Hmm… The amount I need to look at is variable. I need code that can scale up though as the number of line-plane intersections will increase exponentially in this algorithm.

I was hoping for a 1 thread per face and line combo. Oh right, it’s because this is for the mesh code I posted on here. We have a tetrahedral mesh and I’m trying right now to calculate for every face in the mesh, the result of the intersection formed by the face and the two points opposite it.

The math seems simple but now I’m not so sure if one thread per line-plane test is worthwhile.

little_jimmy · February 10, 2015, 6:18pm

“{ 128_bit_hash, which_tetrahedron, point_opposite }”

why the hash?

“which_tetrahedron”

the tetrahedrons are sequential, such that memory reads are more ordered?

const int ta_id = which_tetrahedron[tid];
const int4 t = ((int4* ) mesh)[ta_id];

MutantJohn · February 10, 2015, 6:50pm

In order to both detect errors in the mesh and simplify linking the faces, I figured it’d be best to hash each face and then sort the arrays so this way, identical hashes are next to each other in the array.

Using the tuple, I can determine which two faces match and if they do, I have data for which tetrahedron it is in the mesh (an integer representing location in the mesh buffer) and which face of the tetrahedron it is (an integer between 0 and 3).

If two hashes in a row are the same in the hash buffer, we have a matching face. Anything higher means the mesh code broke down somewhere because more than two tetrahedra have the same face which is mathematically impossible.

{ 0123, 0, 1 }
{ 0123, 1, 3 }

Consider the above segment of the array of tuples. We see the hashes match. From this, we can say that face 1 of tetrahedron 0 links to face 3 of tetrahedron 1.

So far, all my mesh generation has produced “quality” tetrahedra in the sense that none are flat and I have no face errors so it’s a “real” mesh. What I’m trying to solve now is something built on top of the mesh and has the potential to be a huge performance bottleneck.

little_jimmy · February 11, 2015, 4:37am

the usage of hashes makes sense then; but does the user of hashes?
is this really something you would have the device concern itself with, or something the host can help the device with

i may very well be wrong, but it seems as if the tetrahedrons pushed (to the device) for (line-plane) processing are unordered/ non-sequential, and this could negatively impact global memory access
i can understand if the tetrahedrons populate a mesh, thus disturbing any order; but, at the same time, this should be something the host can help the device with, particularly when done in batches and when using streams, such that the device is hardly held up by the host
when you have the necessary data, tetrahedrons are processed individually in terms of line-plane intersections, outside of any mesh context, i gather
preparing the tetrahedrons for processing could help performance then

MutantJohn · February 11, 2015, 4:33pm

jimmy, you got me really thinking about access patterns now.

Plus, that topic I had about coalescing finally linked me to some really material about how global memory works.

Fundamentally, I’m only testing a subset of the mesh. I won’t go into too much detail about it (I could talk for awhile about this) but I’m really only testing some “bad” tetrahedra in the mesh.

So I was thinking that I could make a private copy of the points and then I’d have darn near perfect access.

Each tetrahedron is 4 vertices. Each vertex is 4 floats so 12 bytes. I also want the point opposite the face so that’s 5 total = 60 bytes which means I can load in two complete 5-tuples of points which puts me at 120/128 for global load efficiency.

The only problem is that this can potentially eat up a lot of memory but the subset of “bad” tetrahedra in the mesh will hopefully diminish as I repair them.

Would I be able to somehow used shared memory for this also? This is one of the first times I’ve ever really thought about memory accesses in CUDA.

little_jimmy · February 11, 2015, 5:06pm

i would:

a) have the host extract the subset to help the device with global memory access

now, the device should be able to access data like this:
const int4 t = ((int4* ) mesh)[thread_id];

instead of:

const int4 t = ((int4* ) mesh)[ta_id];

because:

const int ta_id = which_tetrahedron[tid];

is ‘non-ideal’

b) have the host tend to hashes, as part of extraction, such that the device can skip this:

if (128_bit_hash[tid - 1] == 128_bit_hash[tid])

c) have the host arrange and forward the necessary data to the device like this:

float* opposite_point;
float4* the_vertices;

i do not see this wasting anything

both arrays are access by the device, like this: array[threadId]

d) process the subset of tetras in batches, to limit the device waiting for the host:

host:

for batch size and subset size;
remaining_count = subset;
while remaining_count > 0
{
extract and compact batch - sequentially load array to be passed to device; offset as necessary
issue a kernel in stream s;
increment the stream; roll-over/ reset stream if stream == max stream;
subtract from remaining count, batch size;
}

now, the device can start processing the moment a batch is ready, rather than waiting for the host to pre-process all tetras from the subset first

MutantJohn · February 12, 2015, 4:34pm

Jimmy, you’re a genius!

You’ve literally opened me up to a brand new of thinking that I had either previously ignored or had just not been exposed too.

I think I can use this to solve many problems and it’s so genius. The CPU is busy arranging and preparing data to exploit the GPU and batching allows occupation of both hardwares. Holy poop. This… This changes the game!

Topic		Replies	Views
Improper use of __ldg() causes illegal memory access CUDA Programming and Performance	9	2094	January 31, 2015
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16335	January 30, 2011
Can a Kernel be too big?? CUDA_ERROR_NO_BINARY_FOR_GPU error 209 CUDA Programming and Performance	11	3068	November 13, 2017
Using Shared Memory in CUDA C/C++ Technical Blog	36	2015	October 8, 2020
Effective global memory bandwidth? CUDA Programming and Performance	17	17572	September 18, 2007
Thread Synchronization CUDA Programming and Performance	4	8504	October 2, 2009
How would you do this? CUDA Programming and Performance	12	4471	August 5, 2008
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29340	April 23, 2010
tuning a fairly simple kernel / poor performance when returning (writing) data CUDA Programming and Performance	12	1625	November 21, 2013
How to Access Global Memory Efficiently in CUDA C/C++ Kernels Technical Blog	7	659	December 5, 2019

Do any gurus here have advice for how to write this kernel? (math related)

Related topics