[beginner] contains in shared memory array

Hello all,

I’m a beginner in Cuda programming, and new to this forum :). Was wondering if anyone can help
me with my problem. I have a kernel that goes:

kernel {
extern shared int results[blockDim.x];
extern device int eliminated;

results[threadIdx.x] = myfunction(...);

__syncthreads();

if (results contains 1) {
	eliminated[blockIdx.x] = 1;
} else {
	eliminated[blockIdx.x] = 0;
}

}

I’d like to know if it’s possible to have a code that evaluates whether my “results” array contains 1. If so how?

THanks

Hello all,

I’m a beginner in Cuda programming, and new to this forum :). Was wondering if anyone can help
me with my problem. I have a kernel that goes:

kernel {
extern shared int results[blockDim.x];
extern device int eliminated;

results[threadIdx.x] = myfunction(...);

__syncthreads();

if (results contains 1) {
	eliminated[blockIdx.x] = 1;
} else {
	eliminated[blockIdx.x] = 0;
}

}

I’d like to know if it’s possible to have a code that evaluates whether my “results” array contains 1. If so how?

THanks

for compute capability of 2.0 or greater (Fermi)

appendix B.6 of the programming guide says:

int __syncthreads_or(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them.

for compute capability of 2.0 or greater (Fermi)

appendix B.6 of the programming guide says:

int __syncthreads_or(int predicate);

is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them.

Hi,

Thanks for your reply HappyJack. I use gtx260 tho which is compute capability 1.3.
Is there a workaround for this?

device int contains(int *results, int dimension, int number) {
for (int i = 0; i < dimension; i++) {
if (results[i] == number) return 1;
}
return 0;
}

global void computeEliminatedArray(double *d_fr_iteration, int nDim, int *eliminated) {
extern shared int dominatedresults;

results[threadIdx.x] = myfunction(...);

__syncthreads();

if (contains(results, blockDim.x, 1)) {
	eliminated[blockIdx.x] = 1;
} else {
	eliminated[blockIdx.x] = 0;
}

}

This is what I currently use. WOuld this work?

Thanks

Hi,

Thanks for your reply HappyJack. I use gtx260 tho which is compute capability 1.3.
Is there a workaround for this?

device int contains(int *results, int dimension, int number) {
for (int i = 0; i < dimension; i++) {
if (results[i] == number) return 1;
}
return 0;
}

global void computeEliminatedArray(double *d_fr_iteration, int nDim, int *eliminated) {
extern shared int dominatedresults;

results[threadIdx.x] = myfunction(...);

__syncthreads();

if (contains(results, blockDim.x, 1)) {
	eliminated[blockIdx.x] = 1;
} else {
	eliminated[blockIdx.x] = 0;
}

}

This is what I currently use. WOuld this work?

Thanks

Seems like it would work, but there’s a faster solution:

__global__ void computeEliminatedArray(double *d_fr_iteration, int nDim, int *eliminated) {

	__shared__ int flag;

	flag = 0;

	__syncthreads();

	if (myfunction(...) == 1)

		flag = 1;

	__syncthreads();

	

	if (threadIdx.x==0)

		eliminated[blockIdx.x] = flag;

}

Seems like it would work, but there’s a faster solution:

__global__ void computeEliminatedArray(double *d_fr_iteration, int nDim, int *eliminated) {

	__shared__ int flag;

	flag = 0;

	__syncthreads();

	if (myfunction(...) == 1)

		flag = 1;

	__syncthreads();

	

	if (threadIdx.x==0)

		eliminated[blockIdx.x] = flag;

}

or if there are more 1’s than 0’s you might want to do “nand not” instead of “or”. or a parralel reduction might be faster.

or if there are more 1’s than 0’s you might want to do “nand not” instead of “or”. or a parralel reduction might be faster.

Thanks guys, really appreciate the help.

I have another question, I want to run counting of items in eliminated array that has values of 0.

How do I do that? Here is what I have which doesn’t work.

The threads seems to always count from zero…

[codebox]

global void insertResults(…, int *eliminated, int *d_count) {

if (eliminated[blockIdx.x] == 0) {

	...

	(*d_count)++;

}

}

[/codebox]

in host, I have the following code launching the kernel:

[codebox]int *count;

int h_count = 0;

cudaMalloc((void **) &count, sizeof(int));

cudaMemcpy(count, &h_count, sizeof(int), cudaMemcpyHostToDevice);

insertResults<<<z,n>>> (…, eliminated, count);[/codebox]

Thanks

Thanks guys, really appreciate the help.

I have another question, I want to run counting of items in eliminated array that has values of 0.

How do I do that? Here is what I have which doesn’t work.

The threads seems to always count from zero…

[codebox]

global void insertResults(…, int *eliminated, int *d_count) {

if (eliminated[blockIdx.x] == 0) {

	...

	(*d_count)++;

}

}

[/codebox]

in host, I have the following code launching the kernel:

[codebox]int *count;

int h_count = 0;

cudaMalloc((void **) &count, sizeof(int));

cudaMemcpy(count, &h_count, sizeof(int), cudaMemcpyHostToDevice);

insertResults<<<z,n>>> (…, eliminated, count);[/codebox]

Thanks

Replacing font=“Courier New”++;[/font] with [font=“Courier New”]atomicAdd(&d_count, 1);[/font] would make it work, but be very slow. See the reduction example from the SDK for how to make it faster - the basic idea is to have each thread increment a register (which is a fast operation), and then use slow operations to add the sums from each thread only at the end.

Replacing font=“Courier New”++;[/font] with [font=“Courier New”]atomicAdd(&d_count, 1);[/font] would make it work, but be very slow. See the reduction example from the SDK for how to make it faster - the basic idea is to have each thread increment a register (which is a fast operation), and then use slow operations to add the sums from each thread only at the end.