2-D Memory Allocation Issues

Karl_Stathakis · July 13, 2009, 4:22pm

I’m having issues with CUDA allocating the wrong number of blocks to a grid. It gives me the right answer, but when I run the code, the profiler ignores my “N/2 x N/2” block size specification and sets up a block size of 4x1. Of course, this means that it runs the code VERY slowly (about 5 MFLOPS on a GTS 120). Any ideas?

Profiler output is attached as a JPEG Thanks in advance for any answers.

[codebox]# include <stdio.h>

include <stdlib.h>

include <cuda.h>

global void FILENAME(float *VAR_device, float *ANS_device, size_t pitch, size_t pitch, unsigned int stride, int N)

{

extern shared float data;

int x = blockIdx.x*blockDim.x + threadIdx.x;

int y = blockIdx.y*blockDim.y + threadIdx.y;

for (x=0; x<1; x++)

{

for (y=0; y<N; y++)

{

  ANS_device[y*stride + x] = VAR_device[y*stride + x];

}

}

for (x=N-1; x<N; x++)

{

for (y=0; y<N; y++)

{

  ANS_device[y*stride + x] = VAR_device[y*stride + x];

}

}

for (x=1; x<N-1; x++)

{

for (y=0; y<N; y++)

{

  ANS_device[y*stride + x] = 0.25*VAR_device[y*stride + x - 1] + 0.5*VAR_device[y*stride + x] + 0.25*VAR_device[y*stride + x + 1];

}

}

int main()

{

float *ANS_device, *VAR_device;

int N = 32;

int nIterations = 5000;

float VAR_host[N][N], ANS_host[N][N];

int dimA = N*N;

int numThreadsPerBlock = 256;

int numBlocks = dimA/numThreadsPerBlock;

int sharedMemSize = numThreadsPerBlock;

size_t memSize = N*sizeof(float);

size_t pitch;

cudaMallocPitch((void **)(&ANS_device), &pitch, memSize, N);

cudaMallocPitch((void **)(&VAR_device), &pitch, memSize, N);

unsigned int stride;

stride = pitch/sizeof(float);

for (int j=0; j<1; j++)

{

for (int i=0; i<N; i++)

{

  VAR_host[i][j] = ((float)i+1)*5;

}

}

for (int j=1; j<N; j++)

{

for (int i=0; i<N; i++)

{

  VAR_host[i][j] = 0;

}

}

cudaMemcpy2D(ANS_device, pitch, VAR_host, memSize, memSize, N, cudaMemcpyHostToDevice);

dim3 dimGrid(N/2, N/2);

dim3 dimBlock(2, 2);

for(int k=0; k<nIterations; k++)

{

FILENAME <<< numBlocks, dimBlock, sharedMemSize >>> (ANS_device, VAR_device, pitch, pitch, stride, N);

float *temp = ANS_device;

ANS_device = VAR_device;

VAR_device = temp;

}

cudaMemcpy2D(ANS_host, memSize, VAR_device, pitch, memSize, N, cudaMemcpyDeviceToHost);

cudaFree(VAR_device);

cudaFree(ANS_device);

}[/codebox]

BeachHut · July 15, 2009, 9:08am

Can you show more of the code (the actual kernel invocation would be good) and the profiler output?

Karl_Stathakis · July 15, 2009, 2:59pm

Done. Thanks for checking this out.

BeachHut · July 15, 2009, 3:21pm

Try changing numBlocks to dimGrid in your kernel invocation

Karl_Stathakis · July 15, 2009, 5:44pm

That worked…sort of. I changed numBlocks to dimGrid as suggested, then experimented with the values for dimGrid and dimBlock. The fastest configuration by far was:

dimBlock set to “dim3 dimBlock(256, 3)” (since I’ve read that the maximum number of threads on a multiprocessor is 768) and

dimGrid set to “dim3 dimGrid(2, 2)” (since I have 4 8-core multiprocessors on the GTS 120)

Apparent results are that a 64x64 matrix computes through 5000 iterations in 0.13 seconds, so I’ve just gotten 2 GFLOPS or so from this code.

But then I get new problems. First, the code outputs the same answers no matter the number of iterations called for (though the computation time does change appropriately).

Second, though the code generally gives me a sensible answer when I change the number of blocks (though I don’t know how many iterations it goes through to get that answer), it outputs garbage if “N” goes past the 60’s/70’s range. I understand that shared memory is limited to 16384 bytes in some capacity, but a single-precision 72x72, for example, is definitely not going to overtax that storage space.

So…I’m scratching my head and chewing on this some more–if anyone could tell me what I’m doing wrong, it’d be much appreciated.

Topic		Replies	Views
Weird behavior of CUDA CUDA Programming and Performance	6	5672	February 13, 2008
Unable to access the entire allocated space CUDA Programming and Performance	3	4046	July 2, 2009
Grid dimensions CUDA Programming and Performance	6	5799	September 18, 2009
Limitation of blocks and threads CUDA Programming and Performance	0	2150	March 30, 2012
Newbie help on thread blocks CUDA Programming and Performance	22	10939	December 24, 2008
big blockNum leads to error on shared memory block num and shared memory CUDA Programming and Performance	4	7086	March 9, 2007
Kernel crashes with smaller grid size CUDA Programming and Performance	2	2846	April 10, 2008
Can not use more than 16*256 threads! CUDA Programming and Performance	7	2572	August 4, 2008
strange behavior of data size in cudaMalloc or cudaMemcpy CUDA Programming and Performance	2	5010	February 9, 2009
Problem with performance when gridDim.x > 65536 CUDA Programming and Performance	1	493	February 3, 2021

2-D Memory Allocation Issues

include <stdlib.h>

include <cuda.h>

Related topics