Coalesced VBO Access

Hi,

in order to manipulate vertices in an OpenGL VBO, every Thread in my Kernel reads 3 float point values from global device memory (x,y and z koordinate of a Vertex).
Unfortunately these values are stored via 3N3V format (3* float for normal values, 3*float for vertex values). As you can Imagine this results in a pretty poor memory access performance since it’s absolutely uncoalesced.
Does anyone have an idea how to generate a coalesced access from a Vertex Buffer Object?

You might want to try stuffing it in a float4. It’s done quite a bit in the SDK examples.

Thats possible, but not practical i guess, because everytime i want to render the manipulated vbo, i need to convert the float4 array back to 3N3V and copy it to the vbo, which is a performance nightmare.

In which SDK example is this done?

It also seems, that cudaGraphicsUnmapResources is extremely slow. (needs many seconds to map some thousand vbos). Furthermore cudaGraphicsMapResources allways returns cudaErrorUnknown when i try to map more than 8179 VBOs :(

Is anything known about that issue?

as our friend said, you can try float4 array, it works, and you dont need to convert.
however, independant of that, you need to change the design of your VBO to this, so all threads access their data one next to another.
V1V2V3V4…VmN1N2N3N4…Nm

you can leave the extra “w” coordinate as 1.0f if you want to render with 4 dimensions.
and when rendering the normals you will want to define something like this:
example:
glNormalPointer(GL_FLOAT, 4sizeof(float), num_vertexsizeof(float4) );

with that stride you effectively read normals for each 4 floats.

i hope it helped,
best regards!
Cristobal

I have not directly worked with vertex buffers, but if the data is stored as 6 consecutive floats, the usual approach is to stage into shared memory. Just cast the VBO pointer to (float*) and

__shared__ float stage[6*BS];

for (int i=0; i<6; i++)

  stage[i*BS + threadIdx.x] = ((float*)VBO)[(blockIdx.x + i)*BS + threadIdx.x];

float3 vertex;

vertex.x = stage[6*threadIdx.x + 3];

vertex.y = stage[6*threadIdx.x + 4];

vertex.z = stage[6*threadIdx.x + 5];

Note that you will have some warp serialization due to the stride 6 shared memory access, but this should be a much smaller hit than uncoalesced access.

Unfortunately it’s not possible :( OpenGl has some predefined array structures to draw Vertex Buffers. None of them is structured like XV YN, and actually 3N3V is the only one that fits in our software.

Thats really bad since it results in a misaligned starting address and a stride of 12 bytes since every thread reads a float3 (3V).

(OGL Array structures: GL_V2F, GL_V3F, GL_C4UB_V2F, GL_C4UB_V3F, GL_C3F_V3F, GL_N3F_V3F, GL_C4F_N3F_V3F, GL_T2F_V3F, GL_T4F_V4F, GL_T2F_C4UB_V3F, GL_T2F_C3F_V3F, GL_T2F_N3F_V3F, GL_T2F_C4F_N3F_V3F, GL_T4F_C4F_N3F_V4F)

since one thread reads a full vertex (3 Float Values) i do it like this:

__global__ void TransSingleVBO(FLOAT *pfMem, ULONG ulAnz, TRAFO_MAT sTM)

{

     ULONG lulPosX = blockIdx.x*blockDim.x + threadIdx.x;

     if (lulPosX < ulAnz)

     {          

          LOCAL float4 lf4Tmp = make_float4(pfMem[lulPosX*6+3],pfMem[lulPosX*6+4],pfMem[lulPosX*6+5],1.f);               

          pfMem[lulPosX*6+3] = lf4Tmp.x * sTM.x.x + lf4Tmp.y * sTM.x.y +lf4Tmp.z * sTM.x.z + lf4Tmp.w * sTM.x.w;                     

          pfMem[lulPosX*6+4] = lf4Tmp.x * sTM.y.x + lf4Tmp.y * sTM.y.y +lf4Tmp.z * sTM.y.z + lf4Tmp.w * sTM.y.w;

          pfMem[lulPosX*6+5] = lf4Tmp.x * sTM.z.x + lf4Tmp.y * sTM.z.y +lf4Tmp.z * sTM.z.z + lf4Tmp.w * sTM.z.w;

     }

}

this could be optimized i think, by reading a float3 and cast it (however :S )…maybe thats the point to step in

maybe we misunderstood each other, i was referring to pack the vertexes this way (it works in OpenGL).

[x1, y1, z1, w1][x2, y2, z2, w2]…[xn, yn, zn, wn]

that would lead to 16byte alligment and it is coalesced. and each thread acceses 16byte word data

then for the normals you can do the same, 4 coordinates too but ignore the 4rth one when rendering on opengl using stride parameter.

i havent tested the x1x2x3…xn y1y2y3…yn z1z2z3…zn packing format, but i wouldnt discart before experimenting myself with different stride patterns.

best regards

Cristobal

thanks, now i finally know what you mean. I was fixed in some way on the glInterleavedArrays function, but glVertexPointer and glNormalPointer work aswell.

maybe the second choise is the best: when every thread in a half-warp reads the x-coordinate from memory, this could result in 164 Byte coalesced access, while the x1y1z1w1…xnynznwn thing results in a 1616 Byte Access, which are 2 coalesced accesses (128 Byte is Maximum, right?)

Mhh, i just tryed version 1 ([x1, y1, z1, w1][x2, y2, z2, w2]…[xn, yn, zn, wn]) … visual profiler claims that there’s no coalesced access at all on my 9800 GTX (CC 1.1). Why ;(

let mecheck on the 9800GTX+ i have on a friends house, i’ll test the GPU and let you know. But it should be coalesced, as far as i know float4 vectors packed together give memory coalescense, some one correct me if im wrong.

we should also test the second option too.

well the second option is hard to render. as you know, glVertexPointer needs only size, type and stride as parameters…which probably means that the x,y and z values of each vertex have to follow each other.

I allready have an idea why it could be misaligned (and therefore uncoalesced) … the half warp which is processing the last vertices of one vbo also processes the first vertices of the next vbo (i splitted all vbo’s i want to process in float4 blocks and passed the block-addresses to the kernel to avoid multiple kernel launches)…that could result in a misaligned starting address, but i can avoid that by filling up with empty blocks.

I will try that at monday.

But even if that will do it, it does not explain why i have ZERO uncoalesced memory accesses…at least the first vbo should be accessed coalesced with the current code :(

Actually thats only optimistic tuning, because the Kernel runtime is in hard cases about 20 Milliseconds…which results in a framerate of 2 or 3 fps. Mapping, Unmapping and the calculation of my transformation matrix on the host take too much time :D

after doing several tests with the 9800GTX+, this is what i’ve found:

  • the 32-bit (x1x2x3…xny1y2y3…ynz1z2z3…zn) design will render wrong geometry (using fixed openGL pipeline). So we need to stick to the packed float4 mode.

  • after testing my VBO program with packed float4 vectors. i realized that im also getting uncoalesced acceses… this is strange.

  • also the example from the SDK (simpleGL), which uses packed float4, shows uncoalesced too!!! with any gpu of compute capability 1.1 (i cannot say anything for the other architectures).

  • however, i made a simple example (inspired on the vecAdd from SDK) not using opengl, where i just sum two arrays A and B where each element is a float4 vector, and the profile shows that this is coalesced.

i really do not understand well what is going on, is it OpenGL responsable for uncoalesced problem?
if anyone can help is welcome.

well so it’s time for some specialists to solve this.
starting addresses of opengl pointers are absolutley well aligned, there’s nothing that speaks against coalesced access.
maybe it’s some kind of mapping in the cudaMalloc function which seperates a vbo from a common buffer.

at least the nbody example from the sdk is coalesced, so inside is the solution for the problem.

ill let you know if i find the exact difference,

best regards

Cristobal

snowball, i found the problems. there are of 3 different types.

  • first, the simpleGL example had a bad blocksize, you can change it to (16, 16, 1) and youll see full coalesced writes.

  • second, sometimes the compiler optimizer does bad things for memory coalescence. So if using float4 per thread, make sure you use all coordinates in the code, otherwise the compiler detects that one coordinate is not used and destroys the coalesced acceses into several independant ones. Other ways of solving this is to declare your local variables as “volatile”, this way you force the compiler to actually put them into the register variables and dont do any optimization regarding the amount of data you end up using from the float4 struct. for example like this.

__global__ void dummy_kernel( float4 *global_data ){

  int i = blockIdx.x * blockDim.x + threadIdx.x;

  volatile float4 reg_vec = global_data[i];

  //...continue working with the variable

  //for writing this works coalesced.

  global_data[i] = make_float4( reg_vec.x, reg_vec.y, reg_vec.z, reg_vec.w );

}

*And the last important tip, This must be the worst one, i found it after reading extensively the coalescence documentation at nvidia programming guide.

The thing is that for 1.1 and 1.0 compute capability, there is a HUGE limitation with reading global memory when there are offsets to the data read. I mean this: when we have all vertexes packed at the beginning followed by the normals, and finally by the colors, all of them in one big data pointer, then the offset to get to the normals and colors most of the time will imply uncoalesced memory transactions. This, i think is because the only offset that is coalesced is exactly a multiple of 16 bytes. correct me here if im wrong.

The solution is to have the data in different data pointers. So you will want one pointer for the vertex, one for the normals, and one for the colors. I have to do this change too at my code.

good luck dude, i hope this helped

best regards

Cristobal

Thats some kind of amazing, how did you achieve all that knowlegde?

Big thanks! … some of these facts should be written bold in the programming or best practice guide