Some questions... __syncthreads, qualifiers...

Hello there.

I’ve some questions and I’ll appreciate if someone can help me on they.

I’ve read in somewhere that “host and device qualifiers can be combined”, what kind of situation this would be need?

Another naive questions is, why the code below works (its reverse an array) even without the __syncthreads? Does the threads in the same block execute sequentially?

[codebox]#include <stdio.h>

#define TAM 15

global void kernel(int *d_a) {

int idx = threadIdx.x;

int val = d_a[idx];

__syncthreads();

d_a[blockDim.x - idx - 1] = val; 	

}

int main(void) {

int h_a[TAM];

for (int i=0; i<TAM; i++) h_a[i] = i+1;

int *d_a;

cudaMalloc((void **)&d_a, TAM * sizeof(int));

cudaMemcpy(d_a, h_a, TAM * sizeof(int), cudaMemcpyHostToDevice);

kernel<<<1, TAM>>>(d_a);

cudaMemcpy(h_a, d_a, TAM * sizeof(int), cudaMemcpyDeviceToHost);

for (int i=0; i<TAM; i++) printf("%d\n", h_a[i]);

return 0;

}[/codebox]

Cheers, John.

It can be useful if you want to avoid code duplication and have the same function available to both CPU and GPU. I sometimes use both host and device for utility functions.

Actually, this code has an inherent race condition with or without __syncthreads(), and only works on accident. (Probably because you are only running 15 threads, which is less than the warp size.) __syncthreads() is a barrier which is only useful for preventing read-after-write hazards when multiple threads in a block are accessing the same elements of shared memory. It offers no synchronization when accessing device memory (d_a).

One correct data-parallel way to do an array reversal in CUDA would be to have each thread be responsible for swapping a pair of elements, rather than copying only one element to its new location. Here’s an untested example of what I’m talking about:

// kernel: n = number of elements in array, d_a = pointer to first element of array

__global__ void kernel(int n, int *d_a) {

  int left_idx = blockDim.x * blockIdx.x + threadIdx.x;

  int right_idx = n - idx - 1;

// left index must be in first half of array

  // Using if statement here allows us to use good block sizes, like 256 threads,

  // even if n is not a multiple of the block size.  excess threads will do nothing 

  if (idx < n / 2) {

	int left = d_a; // this read will be coalesced

	int right = d_a;  // this read will not be coalesced unless you are using a GT200 GPU

	d_a = right; // this write will be coalesced

	d_a = left; // this write will not be coalesced unless you are using a GT200 GPU

  }

}

Then you would launch this kernel with the following configuration (assuming int n = elements in array for the host code as well):

int total_threads = n / 2; // round down is ok, because center element if n odd does not need to be swapped with itself

int threads_per_block = 256; // Anything will work, but 256 is a nice size for a block if you have no other constraints and your array is long

int blocks = (total_threads - 1) / threads_per_block + 1; // handy trick to do integer division, but round the result up, not down like C default

kernel<<<blocks, threads_per_block>>>(n, d_a);

As mentioned in the comments, I have not dealt with the memory coalescing issue here. Before the GTX 200 series, threads reading a chunk of memory in reverse order would not be coalesced into one memory transaction, resulting in a HUGE loss of effective bandwidth. Fixing this would require using shared memory as a staging area. (it’s a nice exercise, actually. you have your threads read a block of memory on both the left and right half of the array into shared memory, then do the reversal in shared memory, and write the resulting blocks back out to device memory in left-to-right order.)

Thank you seibert, I got it.

Thank you again, this opened my mind for a lot of new things!

Cheers, John.