I am back and still struggling with memory access errors for some reason. For varGpuBlockSize = 256 * 512 = 131072 (256 blocks are used, 512 threads per block), with varDelimiterSize = 66 (length of the bit sequence) and paralellisation factor varGpuBlockMultiplier = 64, I allocate 553,648,128 bits = 66MB of memory even.
bool * varSequenceDelimiter = NULL;
cudaMallocManaged(&varSequenceDelimiter, varGpuBlockMultiplier * varGpuBlockSize * varDelimiterSize * sizeof(bool));
subIntializeDelimiters <<<varGpuBlockCount, varGpuThreadCount>>> (varSequenceDelimiter, varDelimiterSize, varSpaceSizeReserved, varGpuBlockCount, varGpuThreadCount, varGpuBlockSize, varGpuBlockMultiplier);
cudaDeviceSynchronize();
the kernel is very simple, just going through individual thread block instances and parallel combinations of each thread block and just display parameters, not even modifying memory content.
global void subIntializeDelimiters(bool * varSequenceDelimiter, int varDelimiterSize, int varSpaceSizeReserved, int varGpuBlockCount, int varGpuThreadCount, int varGpuBlockSize, int varGpuBlockMultiplier)
{
// repeat the process within each of the GPU blocks
// counting starts from 1, to account correctly for case of 1 GPU block
for (int varGpuBlock = 0; varGpuBlock < varGpuBlockMultiplier; varGpuBlock++)
{
// calculate the relative position start for this thread
unsigned long long int varElementNumber = varGpuThreadCount * blockIdx.x + threadIdx.x + varGpuBlockSize * varGpuBlock;
unsigned long long int varPositionStart = varDelimiterSize * varElementNumber;
printf("\n[B%d,T%d,BC%d] - position start: %llu for element: %llu", blockIdx.x, threadIdx.x, varGpuBlock, varPositionStart, varElementNumber);
}
}
there are 512 * 256 = 131072 thread blocks = thread number range then [0;131071] calculated as: varGpuThreadCount * blockIdx.x + threadIdx.x
thread number min => 512 * 0 + 0 = 0 (OK)
thread number max => 512 * 255 + 511 = 131071 (OK)
The varElementNumber (element number) is calculated in the function of thread block number, and paralellisation factor varGpuBlock: (varGpuThreadCount * blockIdx.x + threadIdx.x) + varGpuBlockSize * varGpuBlock. With paralleisation factor of 64, there are 131072 * 64 = 8,388,608 elements in total to examine, ranging [0;8,388,607].
thread block 1 start => (512 * 0 + 0) + 131072 * 0 = 0 (OK)
thread block 1 end => (512 * 255 + 511) + 131072 * 0 = 131071 (OK)
thread block 2 start => (512 * 0 + 0) + 131072 * 1 = 131072 (OK)
thread block 2 end => (512 * 255 + 511) + 131072 * 1 = 262143 (OK)
thread block 3 start => (512 * 0 + 0) + 131072 * 2 = 262144 (OK)
thread block 3 end => (512 * 255 + 511) + 131072 * 2 = 393215 (OK)
…
thread block 64 end => (512 * 255 + 511) + 131072 * 63 = 8,388,607 (OK)
Each element is 66 bits long, so for each element in each thread block and for each parallel block: varDelimiterSize * varElementNumber
thread block 1 start => 66 * [(512 * 0 + 0) + 131072 * 0] = 0 (OK)
thread block 1 end => 66 * [(512 * 255 + 511) + 131072 * 0] = 8,650,686 (OK)
thread block 2 start => 66 * [(512 * 0 + 0) + 131072 * 1] = 8,650,752 (OK)
thread block 2 end => 66 * [(512 * 255 + 511) + 131072 * 1] = 17,301,438 (OK)
thread block 3 start => 66 * [(512 * 0 + 0) + 131072 * 2] = 17,301,504 (OK)
thread block 3 end => 66 * [(512 * 255 + 511) + 131072 * 2] = 25,952,190 (OK)
…
thread block 64 end => 66 * [(512 * 255 + 511) + 131072 * 63] = 553,648,062 (OK)
The math for calculating positon within 553,648,128 bit block allocated to kernel works then perfectly fine. However, when running the code with Nsight and memory checker enabled, this is all I get
CUDA context created : 1dd4917bea0
CUDA module loaded: 1dd5dd77b20 kernel.cu
CUDA grid launch failed: CUcontext: 2049925693088 CUmodule: 2050273803040 Function: _Z22subIntializeDelimitersPbiiiiii
CUDART error: cudaLaunch returned cudaErrorLaunchFailure
CUDART error: cudaDeviceSynchronize returned cudaErrorLaunchFailure
CUDART error: cudaGetLastError returned cudaErrorLaunchFailure
Here is the output from the cout
=========================
GPU Device 0: “GeForce GTX 1080” with compute capability 6.1
Device 0 GeForce GTX 1080 with Compute 6.1 capabilities will be used
CUDA kernel launch (initialize delimiters) with 256 blocks and 512 threads per block, thread block multiplier 64
[B39,T32,BC0] - position start: 1320000 for element: 20000
[B39,T33,BC0] - position start: 1320066 for element: 20001
(… truncated, long output … )
[B26,T157,BC0] - position start: 888954 for element: 13469
[B26,T158,BC0] - position start: 889020 for element: 13470
[B26,T159,BC0] - position start: 889086 for element: 13471
position start and element number calculations are correct and given there is not even direct memory access into memory allocated to kernel, there should be no overrun problem at all. However, kernel still returns launch failure for some reason.
Any ideas what the issue might be?
thanks !