about architectures of nvidia geforce 940mx?

duan77084 · January 31, 2018, 7:02pm

Hi, all:

 GPU in my laptop is nvidia geforce 940mx. Who can tell me how many Multi-processors (MP), how many blocks in one MP and max. threads in one block in the GPU? Also, please tell me how to get those information on local laptop.

Thanks a lot in advance!

Dawn

Robert_Crovella · February 2, 2018, 6:06pm

google can tell you that

also, there is a sample code deviceQuery that can give a lot of useful information

duan77084 · February 3, 2018, 10:39pm

Hi,

  Thank you very much for your instructions. After run deviceQuery, I got results:

    ( 4) Multiprocessors, (128) CUDA Cores/MP:     512 CUDA Cores
    Maximum number of threads per multiprocessor:  2048
    Maximum number of threads per block:           1024


  Does this assume each MP has 2 blocks?


  If yes, I want to use those 4 MPs with total 8 blocks to access a matrix, x[101][50]. The part of codes are as:

#defube size 101*50
#define thrds 992
#define blocks size/thrds

global void initzero(cufftComplex *X) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;	


while (idx < size ) {

	X[idx].x = 0.0f;
	X[idx].y = 0.0f;
	idx += blockDim.x*gridDim.x;		
}

}

cufftComplex *f_peak;
checkCudaErrors(cudaMalloc((void **)&f_peak, sizeof(cufftComplex)*size));

initzero <<<blocks, thrds >>> (f_peak);

My results are only first 631 elements in the matrix are initialized with “0”, while rests are all given by the same address.

What is wrong?


Thank you very much one more time!

Dawn

Robert_Crovella · February 4, 2018, 2:53pm

Yes, to get the maximum thread complement on a SM, it is necessary to have more than 1 block resident on the SM.

This isn’t valid code:

This doesn’t give you 8 blocks:

(just do the math on your calculator - I get 5)

I wouldn’t be able to tell what else may be wrong without a complete code.

Is it that hard to provide a complete code? You’ve provided most of it already. I think it would require an extra 5-10 posted lines. Why not just provide that?

duan77084 · February 4, 2018, 6:28pm

Hi,

  Thank you for your guidance and instructions. Here are the complete codes:

#include “cuda_runtime.h”
#include “device_launch_parameters.h”

#include “cufft.h”

#include <stdio.h>

#define xsize 101
#define ysize 50
#define size xsize*ysize
#define thrds 1024
#define blocks 5
#define grid 1

typedef float2 Complex;

global void initzero(cufftComplex *X) {

int idx = blockIdx.x * blockDim.x + threadIdx.x;	

while (idx < size) {

	X[idx].x = 0.0f;
	X[idx].y = 0.0f;
	idx += blockDim.x*gridDim.x;		
}

}

int main() {

Complex data[101][50];
cufftComplex *d_fft;

cudaMalloc((void **)&d_fft, sizeof(cufftComplex)* size);

initzero <<<blocks, thrds >>> (d_fft);

cudaMemcpy(data, d_fft, size, cudaMemcpyDeviceToHost);

// check initialized data
for (int i = 0; i < xsize; i++) {
	printf("i=%d \n", i);
	for (int j = 0; j < ysize; j++) {
		printf("j=%d  %e ", j, data[i][j].x);
	}
	printf("\n");
}

getchar();

return 0;

}

  I am trying to use 1D d_fft in device and copy initialized values in d_fft to 2D data in host. Both d_fft and data have the same size of elements.

  The results show that the first 650 (13 rows and 50 columns) plusing first 32 columns in 14 (13th) row, total 682, are initialized with "0", while rests are all given with same address.


   I think I might use wrong thread number and block number to implement kernel function.


   Thank you very much!

Dawn

Robert_Crovella · February 4, 2018, 8:13pm

This line of code is incorrect:

cudaMemcpy(data, d_fft, size, cudaMemcpyDeviceToHost);
                        ^^^^

The size for cudaMemcpy is always given in bytes. Therefore to transfer size complex quantities, you must multiply size times the size of the element:

cudaMemcpy(data, d_fft, size*sizeof(cufftComplex), cudaMemcpyDeviceToHost);

when I make that change, your code displays all zeroes for me.

duan77084 · February 4, 2018, 10:26pm

OK, it works for me.

Thank you very much for your guidance and instructions one more time!

Dawn

Topic		Replies	Views
Mapping of Blocks to MPs / Threads to MPs CUDA Programming and Performance	1	607	November 19, 2013
Max number of thhreads per block and max number of blocks Jetson Xavier NX cuda , kernel	4	2183	September 11, 2023
Understanding deviceQuery CUDA Programming and Performance	2	4123	June 28, 2014
Organization of threads CUDA Programming and Performance	1	651	December 21, 2011
Max No. of threads CUDA Programming and Performance	2	3830	March 13, 2010
Need help to better understand CUDA structure CUDA Programming and Performance	7	1107	May 17, 2011
GPU: Blocks, Threads, Multiprocessors, and Cuda Cores clarification Help clarifying the terms CUDA Programming and Performance	6	21303	November 9, 2011
CUDA - thread block confusion concept clearity sought CUDA Programming and Performance	6	3012	November 10, 2011
Limitation of blocks and threads CUDA Programming and Performance	0	2116	March 30, 2012
Determine threads per block as product of two variables. CUDA Programming and Performance	2	7636	September 11, 2010

about architectures of nvidia geforce 940mx?

Related topics