I am running calculations in parallel across multiple thread blocks (hence the use of CUDA), some of which produce viable results and others do not. Right now, I flag viable results in a separate boolean array, that is later on searched on host side for viable results and these are saved into storage.
To improve processing speed and minimize host CPU processing load, I wanted to be able to pass a vector into the kernel and allow individual thread blocks to push data into it when and if a viable solution is found. I do not care about the order in which results are pushed into the vector, as long as they make it there and I do not have to search through a very large set of arrays and minimize CPU load.
I have been looking at thrust vector libraries (already found plenty of reasons why stl::vector will not work) but I do not see any coding examples where the thrust::vector is actually passed into the kernel and then resized (pushed / poped). Rather, most examples focus on passing a pointed to vector (converted into nothing more than array) and vector size, and then performing calculations on individual array elements.
Here is an example code what I was trying to achieve (obviously, it does not compile). Let me know if anything like that is even possible within CUDA kernel …
__global__ void cudaVectorFill(thrust::device_vector<int> & H)
// calculate thread block element number assuming 128 threads per block
long long unsigned int varElementNumber = 128 * blockIdx.x + threadIdx.x;
// push the thread block number into a vector
cudaVectorFill <<<2, 128>>> (H);
cudaError_t varCudaError1 = cudaGetLastError();
if (varCudaError1 != cudaSuccess)
std::cout << "Failed to launch subDelimiterExamine kernel (error code: " << cudaGetErrorString(varCudaError1) << ")!" << std::endl;
for (unsigned int iVar = 0; iVar < H.size(); iVar++)
std::cout << "Value: " << H[iVar] << ", position: " << iVar << std::endl;
pushing into the same vector from thousands GPU threads executing cincurrently doesn’t look like a good idea, it’s why these operations aren’t implemented
you can do it manually using atomicIncrement for the vector index (and of course allocating large enough vector beforehand)
another possibility is to extend your current boolean-flags code with on-GPU compacting:
Thanks for a quick response ! That saves me the time looking for a solution that clearly does not exist :)
I am not sure whether it is good idea or not, but when used with caution, I can see how it could be useful. I understand, however, the complexity of keeping one shared global size information and then scaling the allocation at the pace of all thread blocks running in parallel. Seems like a challenge …
The example you shared is interesting but it assumes that data is stored already in thrust vector constructs. i am operating on more or less static arrays. To use the filtering capability, I would need to convert all calculations to thrust vectors, I assume. Otherwise, I would have to create thrust vectors from arrays after calculations are done, and only then filter them out accordingly. Sounds like a lot of operations to be honest. Am I missing some optimization in the way?
thrust::device_vector methods are not usable in device code.
here’s an example implementation of push_back, along the lines of what BulatZiganshin was saying:
whether it’s fast enough or not for your test case, I can’t say. atomic performance is all over the map, it is both GPU architecture and data-dependent.
and as already mentioned, stream compaction or reduction techniques can be used in place of atomics
Thank you, I will take a look at both approaches and report back once I am done with my test code
This is also a place where warp-aggregated atomics might be useful:
In fact if you do a intra-threadblock stream compaction, and reduced your vector updates to one atomic per threadblock, that would probably be more efficient.
thrust iterators aren’t limited to thrust vectors, you can easily build iterator from any device pointer
about atomicAdd performance - you will have single variable shared among all threads, residing in L2 cache. AFAIK, L2 cache delays are about 100-200 GPU cycles, so if you target 1e6 updates per second or less, it may be fast enough. But i’ve never tried it. If you need more performance, you can alloc N independent vectors, and split work among them. If you have even more work, using one vector per thread block and placing the atomic counter into shared mem may be optimal. and finally, warp-aggregated atomics will be optimal when probability of value included in the answer is more than 1-10%
But at this point, you will need a lot of work to bring all the collected data back to CPU, so simple stream compaction technique looks more attractive.
To clarify my statement:
I was referring to the use of push_back in OP’s original code, (which is a device_vector method, as used there, and is not workable). I was not referring to any mention of custom iterators, which are potentially usable in device code and are not device_vector methods (except for the defined/built-in iterators like .begin(), .end(), .rbegin(), etc.)
As an aside, unless you are using managed memory correctly, kernel reference parameters are not workable in CUDA either.
i answered to “The example you shared is interesting but it assumes that data is stored already in thrust vector constructs”. As i can guess, topic starter looked at this:
and concluded that he will need thrust vectors to fill in all those InputIterator/OutputIterator parameters. But:
- thrust::copy_if is called from host code
- thrust::copy_if happily accepts pointers to device memory
Examples should be available in Thrust guide shipped with CUDA