about architectures of nvidia geforce 940mx?

Hi, all:

 GPU in my laptop is nvidia geforce 940mx. Who can tell me how many Multi-processors (MP), how many blocks in one MP and max. threads in one block in the GPU? Also, please tell me how to get those information on local laptop.

Thanks a lot in advance!

Dawn

google can tell you that

also, there is a sample code deviceQuery that can give a lot of useful information

Hi,

  Thank you very much for your instructions. After run deviceQuery, I got results:

    ( 4) Multiprocessors, (128) CUDA Cores/MP:     512 CUDA Cores
    Maximum number of threads per multiprocessor:  2048
    Maximum number of threads per block:           1024


  Does this assume each MP has 2 blocks?


  If yes, I want to use those 4 MPs with total 8 blocks to access a matrix, x[101][50]. The part of codes are as:

#defube size 101*50
#define thrds 992
#define blocks size/thrds

global void initzero(cufftComplex *X) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;	


while (idx < size ) {

	X[idx].x = 0.0f;
	X[idx].y = 0.0f;
	idx += blockDim.x*gridDim.x;		
}

}

cufftComplex *f_peak;
checkCudaErrors(cudaMalloc((void **)&f_peak, sizeof(cufftComplex)*size));

initzero <<<blocks, thrds >>> (f_peak);


My results are only first 631 elements in the matrix are initialized with “0”, while rests are all given by the same address.

What is wrong?


Thank you very much one more time!

Dawn

Yes, to get the maximum thread complement on a SM, it is necessary to have more than 1 block resident on the SM.

This isn’t valid code:

This doesn’t give you 8 blocks:

(just do the math on your calculator - I get 5)

I wouldn’t be able to tell what else may be wrong without a complete code.

Is it that hard to provide a complete code? You’ve provided most of it already. I think it would require an extra 5-10 posted lines. Why not just provide that?

Hi,

  Thank you for your guidance and instructions. Here are the complete codes:

#include “cuda_runtime.h”
#include “device_launch_parameters.h”

#include “cufft.h”

#include <stdio.h>

#define xsize 101
#define ysize 50
#define size xsize*ysize
#define thrds 1024
#define blocks 5
#define grid 1

typedef float2 Complex;

global void initzero(cufftComplex *X) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;	

while (idx < size) {

	X[idx].x = 0.0f;
	X[idx].y = 0.0f;
	idx += blockDim.x*gridDim.x;		
}

}

int main() {

Complex data[101][50];
cufftComplex *d_fft;

cudaMalloc((void **)&d_fft, sizeof(cufftComplex)* size);

initzero <<<blocks, thrds >>> (d_fft);

cudaMemcpy(data, d_fft, size, cudaMemcpyDeviceToHost);

// check initialized data
for (int i = 0; i < xsize; i++) {
	printf("i=%d \n", i);
	for (int j = 0; j < ysize; j++) {
		printf("j=%d  %e ", j, data[i][j].x);
	}
	printf("\n");
}

getchar();

return 0;

}


  I am trying to use 1D d_fft in device and copy initialized values in d_fft to 2D data in host. Both d_fft and data have the same size of elements.

  The results show that the first 650 (13 rows and 50 columns) plusing first 32 columns in 14 (13th) row, total 682, are initialized with "0", while rests are all given with same address.


   I think I might use wrong thread number and block number to implement kernel function.


   Thank you very much!

Dawn

This line of code is incorrect:

cudaMemcpy(data, d_fft, size, cudaMemcpyDeviceToHost);
                        ^^^^

The size for cudaMemcpy is always given in bytes. Therefore to transfer size complex quantities, you must multiply size times the size of the element:

cudaMemcpy(data, d_fft, size*sizeof(cufftComplex), cudaMemcpyDeviceToHost);

when I make that change, your code displays all zeroes for me.

OK, it works for me.

Thank you very much for your guidance and instructions one more time!

Dawn