This can be seen in the screenshots on top-right of img.
First screenshoot: Grid(32,1,1), BlockDim(64,1,1)
Second: Grid(896,1,1), BlockDim(64,1,1)
Hmm, yes, you right, it’s incorrect to discuss the problem on empty kernels. Ok, I switched to debug mode and ran my full code:
__global__ void processIIR(float* samples, IIRBandFactors* bandFactors, IIRBandData* bandData, float* afc)
{
int blockId = blockIdx.x;
int threadId = threadIdx.x;
int bandId = blockId*IIR_THREAD_NUM + threadId;
__shared__ float sampleBlock[IIR_SAMPLE_BUFFER_SIZE];
for (int i = 0; i < IIR_THREADS_SHARED_COPY_SIZE; i++)
{
int idx = threadId*IIR_THREADS_SHARED_COPY_SIZE + i;
sampleBlock[idx] = samples[idx];
}
__syncthreads();
IIRBandFactors bf = bandFactors[bandId];
IIRBandData bda[IIR_CASCADES_NUM];
for (int i = 0; i < IIR_CASCADES_NUM; i++)
{
int idx = bandId*IIR_CASCADES_NUM + i;
bda[i] = bandData[idx];
}
for (int s = 0; s < IIR_SAMPLE_BUFFER_SIZE; s++)
{
float sample = sampleBlock[s];
float sampler[IIR_CASCADES_NUM + 1];
sampler[0] = sample;
for (int c = 0; c < IIR_CASCADES_NUM; c++)
{
float result = bf.fFC[0] * sampler[c] + bda[c].dd0;
bda[c].dd0 = bf.fFC[1] * sampler[c] - bf.fFC[3] * result + bda[c].dd1;
bda[c].dd1 = bf.fFC[2] * sampler[c] - bf.fFC[4] * result;
sampler[c + 1] = result;
}
int oid = bandId*IIR_SAMPLE_BUFFER_SIZE + s;
afc[oid] = sampler[IIR_CASCADES_NUM];
}
for (int i = 0; i < IIR_CASCADES_NUM; i++)
{
int idx = bandId*IIR_CASCADES_NUM + i;
bandData[idx] = bda[i];
}
}
Host run command:
processIIR << <IIR_BLOCK_NUM, IIR_THREAD_NUM >> >(сuda_iir_sampleBuffer, cuda_iir_factors, cuda_iir_data, сuda_iir_afc);
Defines:
#define IIR_SAMPLE_BUFFER_SIZE 1024
#define IIR_CASCADES_NUM 4
#define IIR_BANDS_NUM 1024
#define IIR_BLOCK_NUM 16
#define IIR_THREAD_NUM 64
#define IIR_THREADS_SHARED_COPY_SIZE 16
Structs:
struct IIRBandFactors
{
float fFC[5];
};
struct IIRBandData
{
float dd0 = 0.0f;
float dd1 = 0.0f;
};
- Should I somehow even align them in memory? or, if in the structure only floats, everything will be aligned automatically well? :-)
This is Infinite Impulse Response (IIR) filter in audio processing software. In kernel “float* samples” - just pack of 1024 samples (flaot values), each of which must be processed by a separate fitler band (int bandId = blockId*IIR_THREAD_NUM + threadId; ). That is, each band works separately with the entire set of samples.
Well, again switch to Release (It makes no sense to measure the time of the full code in debug :-) ) and got the following results on my single GeFroce 1080Ti:
–So, what’s wrong with occupancy? (or with me :-) )






Thanks you for any help!