How to do this (and be fast)

DenisR · February 5, 2008, 7:44pm

I have a kernel in which I have to do some copying based on importance. So I will calculate an array as big my data that contains the index of old data that has to become new data. Something along the lines of :

float x[1024];

float y[1024];

float z[1024];

float vx[1024];

float vy[1024];

float vz[1024];

float importance[1024];

int index[1024];

calculate_indices(index, importance);

for (int k = 0; k < 1024; k++) {

    x[k] = x[index[k]];

    y[k] = y[index[k]];

    z[k] = z[index[k]];

    vx[k] = vx[index[k]];

    vy[k] = vy[index[k]];

    vz[k] = vz[index[k]];

}

It will happen quite often that there are only a few distinct values in index, from low to high like index = {1 1 1 1 123 123 123 123 123 123 123 123 123 …}

So I will have trouble with uncoalesced access. Is the only way to do this fast by binding x,y,z,vx,vy,vz to a texture and sampling the texture? That will likely give me a lot of benefit since a lot of threads in a block will want the same value from the texture. As far as I understood accessing the same index in a global array in a warp leads to serialization. Is that correct?

bbudge · February 5, 2008, 7:51pm

I have a kernel in which I have to do some copying based on importance. So I will calculate an array as big my data that contains the index of old data that has to become new data. Something along the lines of :
float x[1024];

float y[1024];

float z[1024];

float vx[1024];

float vy[1024];

float vz[1024];

float importance[1024];

int index[1024];

calculate_indices(index, importance);

for (int k = 0; k < 1024; k++) {

    x[k] = x[index[k]];

    y[k] = y[index[k]];

    z[k] = z[index[k]];

    vx[k] = vx[index[k]];

    vy[k] = vy[index[k]];

    vz[k] = vz[index[k]];

}
It will happen quite often that there are only a few distinct values in index, from low to high like index = {1 1 1 1 123 123 123 123 123 123 123 123 123 …}

So I will have trouble with uncoalesced access. Is the only way to do this fast by binding x,y,z,vx,vy,vz to a texture and sampling the texture? That will likely give me a lot of benefit since a lot of threads in a block will want the same value from the texture. As far as I understood accessing the same index in a global array in a warp leads to serialization. Is that correct?

[snapback]321172[/snapback]

Hi Denis -

You realize that you’ll be modifying the data you are reading, right? Are you guaranteed to never read after write? It looks like not, from your example. This means that your result could be incorrect if you read from texture.

If you could guarantee it, then texture would be a good solution.

Brian

DenisR · February 5, 2008, 8:00pm

Yes I do realize, in reality I will either:

make sure that if e.g. 312 is a value of my index array, I will make sure the value at position 312 is also 312, so I will not overwrite input-data that will be used. (aka the elegant, hard solution)
copy my x,y,z,vx,vy,vz arrays and bind the textures to the copy (but this will double my global mem usage. I will have to see if that will be a problem in practise. This I call my dumb solution :D )

MisterAnderson42 · February 5, 2008, 11:58pm

Use simple tex1Dfetch fetches for the read. The writes should be trivial to coalesce. The warp serializations will be a non-issue, you are going to be memory bound with this operation.

I realize that this might change your data structures elsewhere in your code, but reading float4 textures (one for position, one for velocity) is faster than replacing the same accesses with individual float texture reads for x,y,z.

Also: I’m curious, what is this for?. I have this exact same operation happening in my code, though importance only has unique values in it. I use it to reshuffle the order of particles in memory to improve memory throughput in other algorithms.

DenisR · February 6, 2008, 5:29am

I have indeed been thinking of switching to float4’s. The only trouble is that elsewhere in my program (another kernel) I need to read in the written values also. So I will have to profile what is faster:

tex1Dfetch float4 + coalesced read (only) of float4
4x tex1Dfetch float + 4x coalesced read (only) of float

You probably have a good guess (given your testing in this field) if using float2’s might be even an option, because as far as I remember float2’s were not to bad in coalesced accesses, while float4 were ‘slow’

I am using it in a filter

So the output of this kernel will be used as input for this kernel (making float4 interesting), but before that as input of another kernel that reads the values coalesced (making float4 not so nice to use)

MisterAnderson42 · February 6, 2008, 3:05pm

My solution is just to always read the float4 texture, even when I could use a coalesced read. Unfortunately this is not feasible when you need to need one kernel to work with many float4 arrays.

However, I do think this one is due for some testing. I made my choice between 4 tex1Dfetch float reads vs 1 float4 read way back in CUDA 0.8. Let’s see if things have changed since then. I’ll try out my key random memory access kernel with the switched data structure and post the results later today.

MisterAnderson42 · February 9, 2008, 4:00pm

So, I’ve been busy and didn’t get a chance to try it out yet. I’m on vacation for the next week, but I will do some benchmarking when I get back.

DenisR · February 9, 2008, 4:53pm

Well, I will also be occupied with traveling most of next week, so I hope I will be able to implement it Monday morning. If I am able, I will report what is fastest for me.

Topic		Replies	Views
Texture Memory vs. Global Memory and float4 CUDA Programming and Performance	5	1905	November 1, 2010
Help needed for Texture blending. CUDA Programming and Performance	7	9406	March 14, 2008
Help Avoiding Un-Coalesced Memory Access CUDA Programming and Performance	9	9296	October 4, 2010
question about texture reads and coalescading reads CUDA Programming and Performance	3	1876	December 12, 2008
global mem reads coalesced per block or warp? CUDA Programming and Performance	5	5547	March 6, 2007
Texture question CUDA Programming and Performance	11	4606	June 3, 2009
Shared memory question CUDA Programming and Performance	27	7529	June 23, 2008
Help with kernel CUDA Programming and Performance	6	1631	April 23, 2010
Question about textures CUDA Programming and Performance	5	7877	May 9, 2008
Coalesced VBO Access CUDA Programming and Performance	14	1832	February 4, 2011

How to do this (and be fast)

Related topics