How to write the large amount of result from threads performance effectively in cuda

Hi colleagues.
My kernel raises up around 1000000 threads, each thread results in the int3 variable or not according to the if condition.
Usually, around over half of all threads meet if condition and produces result. I configured the results are written to global array variable.
My problem is that the time duration of that kernel is very long. as profiling, this is due to writting to global memory many times.
How can i speed up my kernel. how to use global memory effectively? in other case, what and how can i use instead of this scenario to write results.
Thanks in advance.

what are you writing to global memory, and why are you writing it to global memory…?

“over half of all threads meet if condition and produces result”

what is “the result” and in what format is “the result”…?

is “the result” equally “the final result”…?

I am writing the result from each thread to global memory.
The result is a int3 variable. each thread outputs the int3 variable and according to the if condition, if true, the output(int3 variable) must be saved.
after kernel finish, that results will be used on following processing.
So i saved the results on the global memory. code framework follows.

__device__ int3 gvFeatures[1000000];
__device__ int gnFeatureNum = 0;

// device code

__global__ void kernel()
{
    int3 vOutput;
    ...
    if(...)
    {
       { //Critical Section
         gvFeatures[gnFeatureNum] = vOutput;
         gnFeatureNum ++;//Critical Section
       }
    }
}

// host code

kernel<<<10000, 100>>>();

I think this scenario is very general one in cuda kernel.
Please help me with effective scenario.

The critical section is probably what is slowing you down.

Since you’re writing output for ~50% of the threads anyway, the fastest approach would probably be to dispense with the critical section and atomics altogether, and just write to a data array that has one int3 per thread, and a flag array to indicate whether the thread wrote the data or not. After all, 1000000 int3 quantities is only a few (12) megabytes of data. The flag array would require another 4 megabytes.

Then do stream compaction later, if you need to.

The next fastest approach (combining the writing of data and stream compaction) would be to have an index variable in global memory that you atomically increment per thread that needs to actually write its data.

With the returned value from the atomic increment, the thread has in index into an output array of int3 quantities that it can write into, without worrying about other threads (no atomics, no critical section for that write).

The slowest approach would be the use of a critical section, probably.

Thanks txbob.
I doubted critical section at first, so tested the code without critical section.
But no effect in speed performance. and also critical section i implemented is enough fast one.
and also, when each thread produces large structure not one int3, unable to define global memory due to size.
according to your suggestion, it would be sizeof(structure) * 10000000.
Then, i need to another scenario to save the result.
Please guide me.

Then do the following modification of the write+stream compaction method:

  1. Have a global unstructured buffer, and a data index.
  2. When each thread is ready to write to the global buffer, it will decide how much data it needs to write.
  3. It will then do an atomicAdd on the data index, to “request” that size chunk from the global buffer.
  4. The value returned from the atomicAdd represents the index into the global storage that is reserved for that thread/request.
__device__ unsigned char glb_buff[MAX_DATA_SIZE];
__device__ unsigned glb_idx = 0;

__global__ void mykernel(...){
my_struct my_local_data[mysize];

int my_bytes = mysize*sizeof(my_struct);

int my_idx = atomicAdd(&glb_idx, my_bytes);

// then copy my_local_data to glb_buff starting at my_idx

memcpy(glb_buff+my_idx, reinterpret_cast<unsigned char *>my_local_data, my_bytes);

 ...
}

Note that copying data from a single thread in an unstructured fashion like this could be a source of inefficiency. If that is the case, it will help to do a specialized copy that copies as much as possible (16 bytes) per copy iteration, by recasting pointers or what have you, rather than just the raw memcpy.

Ok, thanks. It sounds good solution.
Thank you very much.