Google AdSense

2014年12月2日 星期二

用 RDTSC,CPUID 和 RDTSCP 測量效能的解析度

說明

  • 根據 How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures 這篇文章

下載

結果

  • 每個 loop 執行 1000 次,關閉大部分會影響測量的功能
    Loading hello module...
    loop_size:0 >>>> variance(cycles): 3; max_deviation: 8 ;min time: 44
    loop_size:1 >>>> variance(cycles): 3; max_deviation: 28 ;min time: 44
    loop_size:2 >>>> variance(cycles): 3; max_deviation: 12 ;min time: 44
    loop_size:3 >>>> variance(cycles): 5; max_deviation: 40 ;min time: 44
    loop_size:4 >>>> variance(cycles): 4; max_deviation: 32 ;min time: 44
    loop_size:5 >>>> variance(cycles): 5; max_deviation: 32 ;min time: 44
    loop_size:6 >>>> variance(cycles): 6; max_deviation: 48 ;min time: 44
    loop_size:7 >>>> variance(cycles): 1; max_deviation: 32 ;min time: 48
    loop_size:8 >>>> variance(cycles): 4; max_deviation: 20 ;min time: 48
    loop_size:9 >>>> variance(cycles): 7; max_deviation: 48 ;min time: 48
    loop_size:10 >>>> variance(cycles): 5; max_deviation: 32 ;min time: 48
    loop_size:11 >>>> variance(cycles): 10; max_deviation: 84 ;min time: 48
    .........
    .........
    loop_size:994 >>>> variance(cycles): 1922; max_deviation: 1388 ;min time: 2028
    loop_size:995 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2032
    loop_size:996 >>>> variance(cycles): 1923; max_deviation: 1388 ;min time: 2032
    loop_size:997 >>>> variance(cycles): 0; max_deviation: 0 ;min time: 2036
    loop_size:998 >>>> variance(cycles): 3; max_deviation: 4 ;min time: 2036
    loop_size:999 >>>> variance(cycles): 1815; max_deviation: 1348 ;min time: 2040

    total number of spurious min values = 0
    total variance = 2520492
    absolute max deviation = 1144364
    variance of variances = 17554753199565
    variance of minimum values = 335594

  • 每個 loop 執行 1000000 次,關閉大部分會影響測量的功能
    Loading hello module...
    loop_size:0 >>>> variance(cycles): 809; max_deviation: 23816 ;min time: 44
    loop_size:1 >>>> variance(cycles): 405; max_deviation: 19300 ;min time: 44
    loop_size:2 >>>> variance(cycles): 41; max_deviation: 4992 ;min time: 44
    loop_size:3 >>>> variance(cycles): 13; max_deviation: 1920 ;min time: 44
    loop_size:4 >>>> variance(cycles): 6300; max_deviation: 65320 ;min time: 44
    loop_size:5 >>>> variance(cycles): 378; max_deviation: 19012 ;min time: 44
    loop_size:6 >>>> variance(cycles): 2512; max_deviation: 46956 ;min time: 44
    loop_size:7 >>>> variance(cycles): 14308; max_deviation: 109424 ;min time: 48
    loop_size:8 >>>> variance(cycles): 128449; max_deviation: 357728 ;min time: 48
    loop_size:9 >>>> variance(cycles): 1696; max_deviation: 40980 ;min time: 48
    loop_size:10 >>>> variance(cycles): 834; max_deviation: 22336 ;min time: 48
    loop_size:11 >>>> variance(cycles): 4143; max_deviation: 63780 ;min time: 48
    .........
    .........
    loop_size:994 >>>> variance(cycles): 914214; max_deviation: 668016 ;min time: 2028
    loop_size:995 >>>> variance(cycles): 1596810; max_deviation: 728892 ;min time: 2032
    loop_size:996 >>>> variance(cycles): 1775690; max_deviation: 866988 ;min time: 2032
    loop_size:997 >>>> variance(cycles): 2589904; max_deviation: 984516 ;min time: 2036
    loop_size:998 >>>> variance(cycles): 957907; max_deviation: 677884 ;min time: 2036
    loop_size:999 >>>> variance(cycles): 1254143; max_deviation: 748936 ;min time: 2040

    total number of spurious min values = 4
    total variance = 2631291
    absolute max deviation = 246593400
    variance of variances = 17487031211352
    variance of minimum values = 335929

  • 每個 loop 執行 1000000 次,開啟大部分會影響測量的功能
    Loading hello module...
    loop_size:0 >>>> variance(cycles): 2425; max_deviation: 49056 ;min time: 42
    loop_size:1 >>>> variance(cycles): 20; max_deviation: 3444 ;min time: 42
    loop_size:2 >>>> variance(cycles): 26; max_deviation: 2697 ;min time: 42
    loop_size:3 >>>> variance(cycles): 97; max_deviation: 4395 ;min time: 42
    loop_size:4 >>>> variance(cycles): 40; max_deviation: 2826 ;min time: 42
    loop_size:5 >>>> variance(cycles): 1437; max_deviation: 27309 ;min time: 42
    loop_size:6 >>>> variance(cycles): 30; max_deviation: 2802 ;min time: 42
    loop_size:7 >>>> variance(cycles): 6; max_deviation: 2541 ;min time: 42
    loop_size:8 >>>> variance(cycles): 13; max_deviation: 2433 ;min time: 45
    loop_size:9 >>>> variance(cycles): 60; max_deviation: 3594 ;min time: 42
    loop_size:10 >>>> variance(cycles): 35; max_deviation: 2661 ;min time: 45
    loop_size:11 >>>> variance(cycles): 31; max_deviation: 3534 ;min time: 45
    .........
    .........
    loop_size:994 >>>> variance(cycles): 32588; max_deviation: 46620 ;min time: 1935
    loop_size:995 >>>> variance(cycles): 11208; max_deviation: 22932 ;min time: 1935
    loop_size:996 >>>> variance(cycles): 9178; max_deviation: 15753 ;min time: 1938
    loop_size:997 >>>> variance(cycles): 11525; max_deviation: 55938 ;min time: 1938
    loop_size:998 >>>> variance(cycles): 62386; max_deviation: 229224 ;min time: 1941
    loop_size:999 >>>> variance(cycles): 7847; max_deviation: 6255 ;min time: 1944

    total number of spurious min values = 4
    total variance = 103398
    absolute max deviation = 1191852
    variance of variances = 132114077117
    variance of minimum values = 306145

心得

  • 每個迴圈 1000 次跟 1000000 次的測量結果大致相同,僅有少部分有些微差異,這代表不需要太多次的測量就能得出還能接受的結果。開了所有功能且在使用中的測量結果略快於啥都沒開,可能是因為核心數多或是有開 turbo mode 的關係。我不清楚 RDTSC 用在多核心 CPU 上會不會產生奇怪的結果,至少數據部分看起來算正常,跑越多指令就越慢。另外這測量結果是有經過重開機的,所以數據會比較漂亮,我印象中重開機前跑出來的結果在 total number of spurious min values 這一項應該是有約 100,好在誤差也都小於 20 個 cycle,由於我只需要相對而非絕對的速度,應該不會有任何影響
  •  0000000000000000 <measured_loop>:
       0: 31 c0                 xor    eax,eax
       2: 85 c9                 test   ecx,ecx
       4: 74 17                 je     1d <measured_loop+0x1d>
       6: 66 2e 0f 1f 84 00 00  nop    WORD PTR cs:[rax+rax*1+0x0]
       d: 00 00 00 
      10: 83 c0 01              add    eax,0x1
      13: c7 02 01 00 00 00     mov    DWORD PTR [rdx],0x1
      19: 39 c8                 cmp    eax,ecx
      1b: 75 f3                 jne    10 <measured_loop+0x10>
      1d: f3 c3                 repz ret 
      1f: 90                    nop
    
  • 根據 assembly code,每次迴圈應該做了 add, mov, cmp, jne 四個指令。沒開任何功能時,可以由之前的結果計算出,從第 68 個迴圈開始,每兩個迴圈需要花 4 個 cycle;開了所有功能時,則是從第 112 個迴圈開始,每 65 個迴圈需要花 114 個 cycle,非常固定

資源

沒有留言:

張貼留言