According to this post by Nathan Bell, vivekv80’s code snippet should be working fine.
Yeah, you’re right. I was looking at the documentation http://thrust.googlecode.com/svn/tags/1.2…p__extrema.html which uses an int * directly and so obviously returns an int *. But even then I was wrong, because subtracting the two pointers directly will you give the correct offset, you’ll only get the offset in bytes if you cast the pointers to size_t first.
Yeah, you’re right. I was looking at the documentation http://thrust.googlecode.com/svn/tags/1.2…p__extrema.html which uses an int * directly and so obviously returns an int *. But even then I was wrong, because subtracting the two pointers directly will you give the correct offset, you’ll only get the offset in bytes if you cast the pointers to size_t first.
@erike and @eelsen: Somehow Nathan’s code snippet did not work. I get run-time errors. But his snippet works if I copy the vector to host and then do the thrust::max_element. I am missing a step that can do this on the device, thereby avoiding the unwanted memcopy.
@erike and @eelsen: Somehow Nathan’s code snippet did not work. I get run-time errors. But his snippet works if I copy the vector to host and then do the thrust::max_element. I am missing a step that can do this on the device, thereby avoiding the unwanted memcopy.
Tried finding index using this kernel:
#define nblocks = 25
#define blocksize = 16
__global__ void getmax(cufftComplex *pArray, cufftComplex *pMaxResults, int *x_shift, int *y_shift, int *fftindx, int *fftindy, int *index)
{
//shared memory.
__shared__ cufftComplex max[16]; //max value
__shared__ int pos[16]; //position
int arrayIndex = nblocks*blocksize*blockIdx.y + blocksize*blockIdx.x + threadIdx.x;
max[threadIdx.x].x = pArray[arrayIndex].x;
pos[threadIdx.x] = arrayIndex;
__syncthreads();
int nTotalThreads = blockDim.x; // Total number of active threads
cufftComplex temp;
int idx = 0;
int idy = 0;
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1); // divide by two
if (threadIdx.x < halfPoint)
{
temp.x = max[threadIdx.x + halfPoint].x;
pos[threadIdx.x] = threadIdx.x + halfPoint;
if (temp.x > max[threadIdx.x].x)
{
max[threadIdx.x].x = temp.x;
pos[threadIdx.x] = threadIdx.x + halfPoint;
idx = blockIdx.x+blockDim.x+threadIdx.x;
idy = blockIdx.y+threadIdx.x;
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
pMaxResults[25*blockIdx.y + blockIdx.x].x = max[threadIdx.x].x;
index[25*blockIdx.y + blockIdx.x] = idx; //pos[blockDim.x+blockIdx.x];
x_shift[0] = idx;
y_shift[0] = idy;
}
}
called this 2 times will give the maximum value
getmax <<< nblocks,blocksize >>> (out1_d, pmax1, x_shift1, y_shift1, fftindx, fftindy, index);
getmax <<< 1,blocksize >>> (pmax1, pmax2, x_shift1, y_shift1, fftindx, fftindy, index);
Still get the indexes wrong, Any suggestions??
This kernel can find the maximum in 46 usecs, which is much better than what I started with. If only the indexes were right too…
Reference for this code snippet: http://supercomputingblog.com/cuda/cuda-tu…-communication/
Tried finding index using this kernel:
#define nblocks = 25
#define blocksize = 16
__global__ void getmax(cufftComplex *pArray, cufftComplex *pMaxResults, int *x_shift, int *y_shift, int *fftindx, int *fftindy, int *index)
{
//shared memory.
__shared__ cufftComplex max[16]; //max value
__shared__ int pos[16]; //position
int arrayIndex = nblocks*blocksize*blockIdx.y + blocksize*blockIdx.x + threadIdx.x;
max[threadIdx.x].x = pArray[arrayIndex].x;
pos[threadIdx.x] = arrayIndex;
__syncthreads();
int nTotalThreads = blockDim.x; // Total number of active threads
cufftComplex temp;
int idx = 0;
int idy = 0;
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1); // divide by two
if (threadIdx.x < halfPoint)
{
temp.x = max[threadIdx.x + halfPoint].x;
pos[threadIdx.x] = threadIdx.x + halfPoint;
if (temp.x > max[threadIdx.x].x)
{
max[threadIdx.x].x = temp.x;
pos[threadIdx.x] = threadIdx.x + halfPoint;
idx = blockIdx.x+blockDim.x+threadIdx.x;
idy = blockIdx.y+threadIdx.x;
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
pMaxResults[25*blockIdx.y + blockIdx.x].x = max[threadIdx.x].x;
index[25*blockIdx.y + blockIdx.x] = idx; //pos[blockDim.x+blockIdx.x];
x_shift[0] = idx;
y_shift[0] = idy;
}
}
called this 2 times will give the maximum value
getmax <<< nblocks,blocksize >>> (out1_d, pmax1, x_shift1, y_shift1, fftindx, fftindy, index);
getmax <<< 1,blocksize >>> (pmax1, pmax2, x_shift1, y_shift1, fftindx, fftindy, index);
Still get the indexes wrong, Any suggestions??
This kernel can find the maximum in 46 usecs, which is much better than what I started with. If only the indexes were right too…
Reference for this code snippet: http://supercomputingblog.com/cuda/cuda-tu…-communication/
suggestions/feedback on saving the index correctly?
suggestions/feedback on saving the index correctly?
You are unconditionally setting the value of pos[threadIdx.x] at every iteration, which is probably wrong. That code will be many times slower than any of the other solutions you have been offered in this thread so far.
You are unconditionally setting the value of pos[threadIdx.x] at every iteration, which is probably wrong. That code will be many times slower than any of the other solutions you have been offered in this thread so far.
I was asked to add some code to find the index.
Unfortunately I’m currently away from my hardware so I’ve only done this in emulation debug mode. Not much testing has been done and i cannot guarantee that this code is bug free or has great performance for that matter.
You can give it a spin:
[attachment=18252:my_min_max_w_index.cu]
I was asked to add some code to find the index.
Unfortunately I’m currently away from my hardware so I’ve only done this in emulation debug mode. Not much testing has been done and i cannot guarantee that this code is bug free or has great performance for that matter.
You can give it a spin:
[attachment=23669:my_min_max_w_index.cu]
@avidday: The index needs to be saved when the maximum value is found. So in the first call, the kernel finds 25 maximum values (bcos 25 blocks) and the second time the kernel finds maximum from the 25 values. However, if I can write the the threadIdx corresponding to each blockIdx, the problem is solved. I tried debugging using cuda-gdb but the indexes get corrupted.
@avidday: The index needs to be saved when the maximum value is found. So in the first call, the kernel finds 25 maximum values (bcos 25 blocks) and the second time the kernel finds maximum from the 25 values. However, if I can write the the threadIdx corresponding to each blockIdx, the problem is solved. I tried debugging using cuda-gdb but the indexes get corrupted.
@jimmypettersson: I am trying out your snippet. will post results once I have it running. Thanks in advance !!!
@jimmypettersson: I am trying out your snippet. will post results once I have it running. Thanks in advance !!!
@jimmypetterson: using your code snippet, I can determine the max value and its index (for 400 elements) in 80 us (as per CUDA profiler).
Also I tried running 50 blocks with 400 threads each modifying the warp_reduce_max kernel. With this modification, the maximum value is correct only for 48 blocks and the index is correct only for the first 18 blocks.
const int threads = 400;
//min_max with reduction
__device__ void warp_reduce_max(float smem[threads])
{
smem[threadIdx.x] = smem[threadIdx.x+256] > smem[threadIdx.x] ?
smem[threadIdx.x+256] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+128] > smem[threadIdx.x] ?
smem[threadIdx.x+128] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+64] > smem[threadIdx.x] ?
smem[threadIdx.x+64] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+32] > smem[threadIdx.x] ?
smem[threadIdx.x+32] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+16] > smem[threadIdx.x] ?
smem[threadIdx.x+16] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+8] > smem[threadIdx.x] ?
smem[threadIdx.x+8] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+4] > smem[threadIdx.x] ?
smem[threadIdx.x+4] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+2] > smem[threadIdx.x] ?
smem[threadIdx.x+2] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+1] > smem[threadIdx.x] ?
smem[threadIdx.x+1] : smem[threadIdx.x]; __syncthreads();
}
I took your code as a starting point and did this:
__global__ void getmax(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[threads];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
int arrayIndex;
// Calculate which element this thread reads from memory
arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = blockDim.x*blockIdx.x + threadIdx.x;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val )
{
index[blockIdx.x] = max_index;
}
}
__global__ void getmax_dynamic(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[16];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
// Calculate which element this thread reads from memory
int arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = arrayIndex;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val )
{
if(max_index >= 2*num_blocks)
{
index[blockIdx.x] = max_index;
}
else
{
int indx = index[max_index];
index[blockIdx.x] = indx;
}
}
}
I did this because :
-
in one scenario, I have 20K elements which I need to reduce using 400 threads and 50 blocks. This results in 50 values which are the max per block. But then I get the indexes wrong. The indexes are always a multiple of 16 :(
-
I am not sure what needs to be changed if I used threads not equal to powers of 2.
Let me know your suggestions.
@jimmypetterson: using your code snippet, I can determine the max value and its index (for 400 elements) in 80 us (as per CUDA profiler).
Also I tried running 50 blocks with 400 threads each modifying the warp_reduce_max kernel. With this modification, the maximum value is correct only for 48 blocks and the index is correct only for the first 18 blocks.
const int threads = 400;
//min_max with reduction
__device__ void warp_reduce_max(float smem[threads])
{
smem[threadIdx.x] = smem[threadIdx.x+256] > smem[threadIdx.x] ?
smem[threadIdx.x+256] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+128] > smem[threadIdx.x] ?
smem[threadIdx.x+128] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+64] > smem[threadIdx.x] ?
smem[threadIdx.x+64] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+32] > smem[threadIdx.x] ?
smem[threadIdx.x+32] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+16] > smem[threadIdx.x] ?
smem[threadIdx.x+16] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+8] > smem[threadIdx.x] ?
smem[threadIdx.x+8] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+4] > smem[threadIdx.x] ?
smem[threadIdx.x+4] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+2] > smem[threadIdx.x] ?
smem[threadIdx.x+2] : smem[threadIdx.x]; __syncthreads();
smem[threadIdx.x] = smem[threadIdx.x+1] > smem[threadIdx.x] ?
smem[threadIdx.x+1] : smem[threadIdx.x]; __syncthreads();
}
I took your code as a starting point and did this:
__global__ void getmax(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[threads];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
int arrayIndex;
// Calculate which element this thread reads from memory
arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = blockDim.x*blockIdx.x + threadIdx.x;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val )
{
index[blockIdx.x] = max_index;
}
}
__global__ void getmax_dynamic(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[16];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
// Calculate which element this thread reads from memory
int arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = arrayIndex;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val )
{
if(max_index >= 2*num_blocks)
{
index[blockIdx.x] = max_index;
}
else
{
int indx = index[max_index];
index[blockIdx.x] = indx;
}
}
}
I did this because :
-
in one scenario, I have 20K elements which I need to reduce using 400 threads and 50 blocks. This results in 50 values which are the max per block. But then I get the indexes wrong. The indexes are always a multiple of 16 :(
-
I am not sure what needs to be changed if I used threads not equal to powers of 2.
Let me know your suggestions.
Why do you need to use specifically 400 threads and 50 blocks? Are you trying to fuse it together with some other operations?
My suggestion would be to use the code as it is or motivate why you need to change it.
One issue here is that I wrote the code with big problems (1M+ elements) in mind, 20K isn’t that massive and I believe you would want to change the els_per_block parameter to something considerably smaller.