The times i got by commenting subsequential code show that assigning the automatic variable count to item_cnt[threadIdx.x] caused a terrible performance decline, which should not happen because it is just an automatic variable, and i think the variable must not be relocated in local memory. Even it is placed in local memory, there will not be a huge performance drop in terms of a accesses coalescence. It seem a problem of register, but i can’t figure it out, so hope someone can give me a help
item_id[threadIdx.x] = item_idx;
item_cnt_rec[threadIdx.x] = 0;
__syncthreads();
// take time of 13.629280 ms when assign count to item_cnt[threadIdx.x]
// take time of 0.431072 ms when assign zero to item_cnt[threadIdx.x][/codebox]
it takes time of 0.429952 ms, if i assign item_idx to item_cnt[threadIdx.x], similar to zero!
both item_idx and count are automatic variables, the difference between them is that one is initialized at the time of defining and never change, while the other is frequently modified, i have no idea why there are such a remarkable difference
There is quite a lot of optimization going on at compile time and if you don’t use a variable, it’s quite possible that the code that produced that variable is discarded.
But if the code which uses count never writes anything back to global memory, then the compiler can eliminate it along with all the code which uses it.
according to your advice, i rewrited a small kernel function in which variables like ‘count’ are written back to global memory and did some test, the time i got is very interesting, showing that the more automatic variables are used, the more expensive access to them are. i don’t know much more detail of compilation and execution of a cuda program, so i don’t know why things happen like this, but i really want to know why, and figure it out because it is a performance bottleneck in my program.
here is the code and time i got, you can simply copy and run it on your device to confirm what i said, let’s share our experiments and try to find out the cause
extern __shared__ int tran_buffer[];
extern __shared__ int item_cnt[]; // item_cnt[3*blockDim.x]
int* item_id = &item_cnt[blockDim.x];
int* item_cnt_rec = &item_id[blockDim.x];
unsigned int item_idx = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int count = 0, address = 0;
unsigned int i, j;
for(i = 0; i < 232; i++)
for(j = 0; j < 512; j++)
count++;
d_test_count[item_idx] = address;//j;//i;//0;// item_idx;//count;
// take time of 6.296416 ms when assign count to d_test_count[item_idx]
// take time of 6.292064 ms when assign j to d_test_count[item_idx]
// take time of 1.016544 ms when assign i to d_test_count[item_idx]
// take time of 0.307712 ms when assign item_idx to d_test_count[item_idx]
// take time of 0,308480 ms when assign zero to d_test_count[item_idx]
// take time of 0.308384 ms when assign address to d_test_count[item_idx]
The writing to a global variable will probably be done as soon as the variable does not change anymore (the write is being moved forward for items that are assigned before the double for loop)
I must say that it is completely unclear what you are trying to do, and what you are trying to optimize. I am afraid you are ‘profiling’ in a completely unreliable way, so you draw conclusions that are incorrect.
Maybe you are right, but it is very unnormal that taking time of about 6 ms to write 1024 atutomatic variables to global memory which are frequently changed before, in comparision with taking time of less than 0.1 ms to write 1024 unchanged automatic variables back or taking about time of 0.3 ms to do a for loop of 232*512
There is a performance problem with my program, so i did some test and found out that it seems the prime criminal to assign a frequently changed automatic variable to other automatic variables. (the same thing happen when i write a frequently changed automatic variable to global memory or shared memory) The program contains such an operation runs much slower on GPU than the original version which runs on CPU, if i don’t do anything more.
Maybe i don’t use cuda correctly, maybe there is a bug. anyway, i just want to find out why things happen like this, this is what i am doing
i time my cuda program using cudaevent API, i can’t confirm whether it is reliable, if you have any idea for verifying, please give me your suggestion