Here is what I have I have done so far - each thread block element performs calculation and depending on the outcome, sometimes I get no results of interest, so calculation results can be disregarded and do not need to be processed by CPU and saved into storage.
To improve routine performance, I wanted to find a way to signal from within CUDA kernel back to CPU execution code that there is at least one interesting solution found, and results then need to be parsed by CPU. Otherwise, results can be omitted and next calculation cycle can be executed. Here is the high level pseudo-code
__global__ subKernel(long long unsigned int * varCounter)
{
// do some calculations for the given thread block
if (varResult > 0)
(*varCounter)++; // to signal to CPU that positive result was found in at least one thread block
}
int main()
{
// initial setup
for (i=0;i<varCycleCount;i++)
{
varCounter = 0;
subKernel<<<(parameters)>>> (varCounter);
if (varCounter > 0)
// process results, save into storage
}
Needless to say, when I try to execute something like this, I get failure to launch kernel code with the message back: “an illegal memory access was encountered”, indicating line where varCounter is incremented.
varCounter is created in shared memory space
long long unsigned int * varCounter = NULL;
cudaMallocManaged(&varCounter , sizeof(long long unsigned int));
but that does not seem to have helped. I am passing it as a pointer into kernel and then dereferencing to increment the value.
I was looking for similar topics / problems, but most solutions revolve around creating an array and examining each and every position outside of the kernel, something that relies on CPU and slows down execution substantially. I am looking for a way to signal something back from GPU to CPU
Do you perform error checking on the call to cudaMallocManaged()? What GPU are you using? Have you worked through the introductory blog post and example code for unified memory use here:
Error checking is done both on malloc (no error there, it is just a single variable) and the call to subKernel (this is where the error is signalled).
I am working on GTX1080 at this time - I did find atomicAdd option, but whenever I add it to the code, execution stalls, irrespective of the compilation options (I am already setting target architecture to computer 6.1 compatible).
When you dump out device properties, what is the value for the property concurrentManagedAccess ? I am reasonably sure that this must report as ‘1’ or ‘true’ for your code to work.
If your claim is correct, it should require minimal effort to convert this from pseudo code to a real, complete code example, that is not much longer than what you have shown here, and demonstrate the issue. If you can do that, you can probably get some help.
If you choose to do so, please indicate platform (windows/linux) and CUDA version, as well as compile command line. All of that matters for decoding expected managed memory behavior.
The debug is simple (IMO) - I disable the one line in which varCounter is incremented, I execute the code with full mem-check to verify memory leaks, all executes correctly. No issues at all. Next, I uncomment the line in which varCounter is incremented, compile, run again and indicated error takes place. mem-check does not really add anything to clarity with the information on “illegal memory access” - given that it is a single variable shared across all threads, and not an array where position needs to be calculated, I am only led to conclude there is some race condition and multiple threads attempt to write into memory location at the same time. Kind of makes sense but then what is the way out ?
Windows 10, with Visual C++ 2017 (15.4.5), with CUDA (9.1/9.0 - both tested and NOT making any difference)
I will share compile parameters when PC unlocks - seems like I need a hard reset to get it out of the loop ;(
Thanks! I did not connect the dots - cudaGetDeviceProperties is what I know about and use it already, just for some reason did not click it together :)
Having checked it, concurrent memory access is set to: 0 at this time. It does not seem to affect the operation with fixes per
suggestions - below i can see that I have non-zero number of saved candidate solutions, so the fixes suggested dealt with the problem correctly. It was dumb of me NOT to have noticed the issue beforehand :(
.\delimiter.exe all-delimiters 32 15 2 3 512 2048 1 out 0x00
GPU Device 0: "GeForce GTX 1080" with compute capability 6.1
Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used; concurrent memory access is set to: 0
GPU memory usage: used = 1.3943 GB, free = 6.6057 GB, total = 8.0000 GB
CUDA kernel launch (initialize delimiters) with 2048 blocks and 512 threads per block, thread block multiplier 1:
. completed
GPU memory usage: used = 1.4141 GB, free = 6.5859 GB, total = 8.0000 GB
CUDA kernel launch (delimiter calculations) with 2048 blocks and 512 threads per block, thread block multiplier 1, target iteration cycles 4096
Examined: 1072693248 sequences, cycle: 1023, saved: 72, last candidate: 11111111111111111111111111111100
Examined: 2146435072 sequences, cycle: 2047, saved: 127, last candidate: 11111111111111111111111111111110
Examined: 3220176896 sequences, cycle: 3071, saved: 171, last candidate: 11111111111111111111111111111101
Examined: 4293918720 sequences, cycle: 4095, saved: 252, last candidate: 11111111111111111111111111111111
Examined: 4294967296 sequences, cycle: last, saved: 252, last candidate: 11111111111111111111111111111111
===================================
Last Examined Delimiter:
--- hex: 0xffffffff
--- bin: 11111111111111111111111111111111
Next Start Delimiter:
--- hex: 0x00000000
--- bin: 00000000000000000000000000000000
Execution Statistics:
--- blocks: 2048
--- threads: 512
--- thread block size: 1048576
--- elapsed: 10 seconds
--- delimiters examined: 4294967296
--- delimiters / second: 429496729
--- delimiters saved: 252
--- delimiters saved to files:
>>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.1.txt
>>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.2.txt
>>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.3.txt
>>> out.DEL.32.HAM.15.DC.2.RUN.3.TH.512.BK.2048.CPUTH.4.txt
===================================
concurrent managed access would be expected to be zero. CUDA 9.1 with Windows and a Pascal GPU means you will be operating in the pre-pascal UM regime:
One implication of this is that concurrent managed access is not possible in this regime, therefore the property will be zero. Furthermore, in this regime, host-code access to managed memory after a kernel call must be preceded by a cudaDeviceSynchronize() call, which is why I inserted that:
As an aside, multi-thread unprotected access to a single memory location may lead to unpredictable behavior:
(*varCounter)++;
If you are expecting that variable to track the actual number of times device code attempted to increment it (i.e. to actually keep count) that is unreliable. If you are simply expecting it to be non-zero if one or more threads touches it, that is probably safe (barring overflow). If you do not actually need to keep count, but only use it as a flag that one or more threads indicated a “candidate find”, then this is safe, and predictable:
I understand the number might NOT be accurate and it is fine. I started off with the notion of just a simple boolean flag set to true whenever a result needs to be examined by CPU and saved to storage. I guess I will back convert long long unsigned int to bool and follow your suggestion - that is ultimately what I set out to do anyway :)