getting parts of a vertexbuffer

Hi

I have a directx 10 application to do the rendering of my application. I want to do collision detection on the triangles, so I need the vertex- and index- buffer of a scene. I could do a memcpy of my host data to device but that is slow. It should be much faster to get data from vertex- and indexbuffer.

The problem is that my vertexbuffer is of the struct:

struct Vertex

{

  float3 position;

  float3 color;

  float3 normal;

};

when I map the resource in CUDA to get access to the vertexbuffer I want to only get the position part of the structure. Is that possible to do? Because color and normal is not needed, those will just cause unnecessary global memory accesses.

I have looked at the CUDA C programming guide but that example is not what I want to achieve.

I also wonder: the vertex data receive are they in transformed state? Or will one always get the raw data? If that is the case is there a fast way to get transformed vertices into CUDA?

THX

Hi

I have a directx 10 application to do the rendering of my application. I want to do collision detection on the triangles, so I need the vertex- and index- buffer of a scene. I could do a memcpy of my host data to device but that is slow. It should be much faster to get data from vertex- and indexbuffer.

The problem is that my vertexbuffer is of the struct:

struct Vertex

{

  float3 position;

  float3 color;

  float3 normal;

};

when I map the resource in CUDA to get access to the vertexbuffer I want to only get the position part of the structure. Is that possible to do? Because color and normal is not needed, those will just cause unnecessary global memory accesses.

I have looked at the CUDA C programming guide but that example is not what I want to achieve.

I also wonder: the vertex data receive are they in transformed state? Or will one always get the raw data? If that is the case is there a fast way to get transformed vertices into CUDA?

THX

AFAIK (student, no expert) the vertex buffer will be stored in sequential memory with the position, color and normal interleaved in the same order as defined in the struct. You want to look at using a stride when computing the offset into the memory so you don’t have to consider the color and normal information. You could also potentially copy the relevant (position) values to shared memory which your blocks could use.

The vertices would be untransformed. Instead of applying the world transform to each vertex again though, you could possibly transform the other object into the model space using the inverse world transform. The faster solution there depends on what type of objects you are colliding.

AFAIK (student, no expert) the vertex buffer will be stored in sequential memory with the position, color and normal interleaved in the same order as defined in the struct. You want to look at using a stride when computing the offset into the memory so you don’t have to consider the color and normal information. You could also potentially copy the relevant (position) values to shared memory which your blocks could use.

The vertices would be untransformed. Instead of applying the world transform to each vertex again though, you could possibly transform the other object into the model space using the inverse world transform. The faster solution there depends on what type of objects you are colliding.

Yeah thx for replying. I just did a host copy of all triangles with world mat to device. I think by using index and vertex buffer mapped as resource is quick but there is the added effect that this will hurt coalescing I beleive, because vertex buffer also consist of color and normal data. When transferring position data there will be “gaps” of in device memory (consisting of the color and normal data) that is also transferred because CUDA cannot reduce the size of transferred blocks, so in lower compute capabilities this will give many data transactions for a warp becasue of the added offset for each subsequent position element of VB. In Fermi this might not be a very big deal because of L1 and L2 cache, but still unnecessary data are transferred. This is bad I think: transferring unneded data. I also think that indexbuffer indices does not need to be close to each other. indices for triangle i to triangle i+1 might consist of indices referring to elements not close to triangle i. That pretty much throws away coalescing I think on lower compute capabilities (I might be wrong).

By manually transferring triangle data from host to data, one can arrange the data to be more coalesce friendly I think. For example subsequent threads picks subsequent triangles to process. But of course the data is not mapped so it is slower. I think speedup can be achieved if I use page locked host memory + asynchronous memcpy. Not sure though it it affects the host side to much because I need to transfer 100K (43100K bytes (float) of memory) to millions of triangles. Maybe not because it is around 1 MB host mem page locked.

These are my thoughts, I might be wrong but this is my ideas from reading CUDA prog guide.

THX

Yeah thx for replying. I just did a host copy of all triangles with world mat to device. I think by using index and vertex buffer mapped as resource is quick but there is the added effect that this will hurt coalescing I beleive, because vertex buffer also consist of color and normal data. When transferring position data there will be “gaps” of in device memory (consisting of the color and normal data) that is also transferred because CUDA cannot reduce the size of transferred blocks, so in lower compute capabilities this will give many data transactions for a warp becasue of the added offset for each subsequent position element of VB. In Fermi this might not be a very big deal because of L1 and L2 cache, but still unnecessary data are transferred. This is bad I think: transferring unneded data. I also think that indexbuffer indices does not need to be close to each other. indices for triangle i to triangle i+1 might consist of indices referring to elements not close to triangle i. That pretty much throws away coalescing I think on lower compute capabilities (I might be wrong).

By manually transferring triangle data from host to data, one can arrange the data to be more coalesce friendly I think. For example subsequent threads picks subsequent triangles to process. But of course the data is not mapped so it is slower. I think speedup can be achieved if I use page locked host memory + asynchronous memcpy. Not sure though it it affects the host side to much because I need to transfer 100K (43100K bytes (float) of memory) to millions of triangles. Maybe not because it is around 1 MB host mem page locked.

These are my thoughts, I might be wrong but this is my ideas from reading CUDA prog guide.

THX