I want to do a sum over some elements of a vector (power of two an sorted previously), and write this sum in other vector. The number of elements to sum in each case it’s not known but there are values that match a condition (are pairs index-value). Let me put a simple example:
in vector have pairs (index, values) and contain this information for several points (each of size 4 in this example). So I need to compute the sum of each subvector for every point if the index _i math. Subvector are sorted by index using bitonic sort very fast.
Now, I’m launching as many threads as points I have (in the example 2 threads) that look for the index and write the result, but I want to make this more efficiently. Do you have some ideas to put in practice?
I’ve looking at the examples, but I don’t uderstand some thing and how to adapt it to my particular problem…if you can help I would be very glad.
The question is, suppose I take this code from reduction example
__global__ void reduction(float *g_idata, float *g_odata)
{
SharedMemory< float > smem;
float *sdata = smem.getPointer();
// load shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
it’s obviously that it computes the sum of all values for each index isn’t it? But I want to sum only those values that meet the condition, I mean, I work with pairs (float2) so I want to sum A and B if A.x == B.x, and in this case store the result A.y + B.y in global memory. How can I adapt these). Something like this but it didn’t work:
__global__ void reduction_mine(float2 *g_idata, float2 *g_odata)
{
SharedMemory< float2 > smem;
float2 *sdata = smem.getPointer();
// load shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s)
{
if (sdata[tid].x == sdata[tid + s].x)
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
I’ve looking at the examples, but I don’t uderstand some thing and how to adapt it to my particular problem…if you can help I would be very glad.
The question is, suppose I take this code from reduction example
__global__ void reduction(float *g_idata, float *g_odata)
{
SharedMemory< float > smem;
float *sdata = smem.getPointer();
// load shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s)
{
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
it’s obviously that it computes the sum of all values for each index isn’t it? But I want to sum only those values that meet the condition, I mean, I work with pairs (float2) so I want to sum A and B if A.x == B.x, and in this case store the result A.y + B.y in global memory. How can I adapt these). Something like this but it didn’t work:
__global__ void reduction_mine(<b>float2 </b>*g_idata, <b>float2 </b>*g_odata)
{
SharedMemory< <b>float2 </b>> smem;
<b>float2 </b>*sdata = smem.getPointer();
// load shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=blockDim.x/2; s>0; s>>=1)
{
if (tid < s)
{
<b>if (sdata[tid].x == sdata[tid + s].x) </b>
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
if (sdata[tid].x == sdata[tid + s].x)
sdata[tid] += sdata[tid + s];
else
sdata[tid] = 0; // I guess you should zero it since you dont want it to go into the summation.
yes, I think you’re right, although the sum is in .y…
if (sdata[tid].x == sdata[tid + s].x)
sdata[tid].y += sdata[tid + s].y;
else
sdata[tid].y = 0; // I guess you should zero it since you dont want it to go into the summation.
So, this kernel computes the sum of requires values for one position of the output vector (blockIdx.x), and I need to launch as many blocks as number of elements in my vector? or as many threads as number elements?