Strona 1 z 1

Rosnące obciążenie serwera, duża wartość iowait

: 27 czerwca 2011, 19:56
autor: seba123
Witam.

Znajomy poprosił mnie o pomoc z serwerem. Ciągle rośnie obciążenie oraz z tego co zauważyłem jest ogromny ,,iowait''.

Zainstalowałem iostat i oto wyniki:

Kod: Zaznacz cały

Time: 19:10:37
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.66    0.00    0.17   40.34    0.00   58.82

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda               1.20         5.20         6.40         52         64
sda1              0.80         0.00         3.60          0         36
sda2              0.40         5.20         2.80         52         28
sda3              0.00         0.00         0.00          0          0
sdb               0.30         0.00         8.80          0         88
sdb1              0.10         0.00         7.20          0         72
sdb2              0.20         0.00         1.60          0         16
sdb3              0.00         0.00         0.00          0          0
md2               0.40         0.00         1.60          0         16
md1               1.80         0.00         7.20          0         72

Jak widać zapis/odczyt znikomy, natomiast czas oczekiwania na dysk jest długi. Co kilkanaście minut pojawia się sam proces smartctl:

Kod: Zaznacz cały

1root     13610  0.0  0.0  17344  1324 ?        S    19:08   0:00 sh -c smartctl   -i \/dev\/sda 2>&1
oraz:

Kod: Zaznacz cały

root      1822  0.0  0.0      0     0 ?        S    18:08   0:02 [md2_raid1]
Czy np.:

Kod: Zaznacz cały

 1824 root      20   0     0     0     0 D  0.0  0.0  0:00.92 md2_resync
Użycie procesora zerowe właściwie, przykład z:

Kod: Zaznacz cały

htop

Kod: Zaznacz cały

  1  [#                                    0.6%]     Tasks: 228 total, 1 running
  2  [*                                    0.7%]     Load average: 8.56 12.05 12.52
  3  [                                     0.0%]     Uptime: 01:07:12
  4  [                                     0.0%]
  Mem[|||#**                         517/7990MB]
  Swp[                                0/31997MB]

  PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
13919 root      20   0 19424  1360  1000 R  0.0  0.0  0:00.51 htop
14014 apache    20   0  133M 12964  5588 S  0.0  0.2  0:00.05 /usr/sbin/httpd -k start -DSSL
 4274 root      20   0  131M  9644  4500 S  0.0  0.1  0:00.87 /usr/sbin/httpd -k start -DSSL
    8 root      20   0     0     0     0 S  0.0  0.0  0:00.13 kworker/1:0
14010 apache    20   0  132M 12072  5528 S  0.0  0.1  0:00.02 /usr/sbin/httpd -k start -DSSL
 4446 nobody    20   0 79652 17728   684 S  0.0  0.2  0:03.30 /usr/local/bin/memcached -d -m 1024 -p 11
 4448 nobody    20   0 79652 17728   684 S  0.0  0.2  0:00.83 /usr/local/bin/memcached -d -m 1024 -p 11
13388 apache    20   0  132M 12564  5756 S  0.0  0.2  0:00.15 /usr/sbin/httpd -k start -DSSL
14011 apache    20   0  132M 12432  5624 S  0.0  0.2  0:00.04 /usr/sbin/httpd -k start -DSSL
 4185 bind      20   0  150M 23680  2652 S  0.0  0.3  0:01.17 /usr/sbin/named -u bind
13921 apache    20   0  132M 12664  5860 S  0.0  0.2  0:00.07 /usr/sbin/httpd -k start -DSSL
14015 apache    20   0  133M 12832  5552 S  0.0  0.2  0:00.02 /usr/sbin/httpd -k start -DSSL
13437 apache    20   0  133M 13196  5840 S  0.0  0.2  0:00.07 /usr/sbin/httpd -k start -DSSL
14012 apache    20   0  132M 12380  5576 S  0.0  0.2  0:00.03 /usr/sbin/httpd -k start -DSSL
13295 apache    20   0  132M 12668  5608 S  0.0  0.2  0:00.09 /usr/sbin/httpd -k start -DSSL
 9852 mysql     20   0  542M 79560  5300 S  0.0  1.0  0:05.70 /usr/local/mysql/bin/mysqld --basedir=/us
13450 apache    20   0  133M 13020  5704 S  0.0  0.2  0:00.16 /usr/sbin/httpd -k start -DSSL
 4449 nobody    20   0 79652 17728   684 S  0.0  0.2  0:00.76 /usr/local/bin/memcached -d -m 1024 -p 11
 4447 nobody    20   0 79652 17728   684 S  0.0  0.2  0:00.69 /usr/local/bin/memcached -d -m 1024 -p 11
13173 apache    20   0  132M 13036  5976 S  0.0  0.2  0:00.08 /usr/sbin/httpd -k start -DSSL
F1Help  F2Setup F3SearchF4InvertF5Tree  F6SortByF7Nice -F8Nice +F9Kill  F10Quit
Z wyników smart dyski wyglądają w porządku. Macie jakieś pomysły?

: 27 czerwca 2011, 20:28
autor: lessmian2

Kod: Zaznacz cały

cat /proc/mdstat
?

: 27 czerwca 2011, 22:44
autor: seba123

Kod: Zaznacz cały

server:~#  cat  /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6]  [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sdb1[1]  sda1[0]
     102398912 blocks [2/2] [UU]
       resync=DELAYED

md2  : active raid1 sdb2[1] sda2[0]
     857974720 blocks [2/2]  [UU]
     [>....................]  resync =  0.3% (3170304/857974720)  finish=14534.6min speed=979K/sec
Rzeczywiście, czy da się jakoś przyśpieszyć tę operację?

: 27 czerwca 2011, 22:55
autor: Unit
seba123 pisze:Rzeczywiście, czy da się jakoś przyśpieszyć tę operację?
Spróbuj:

Kod: Zaznacz cały

echo "50000" >  /proc/sys/dev/raid/speed_limit_min

: 27 czerwca 2011, 22:57
autor: seba123
Jakiś restart jakiejś usługi potrzebny?

: 27 czerwca 2011, 23:28
autor: Unit
seba123 pisze:Jakiś restart jakiejś usługi potrzebny?
Nie. Po wywołaniu komendy powinieneś zauważyć wzrost prędkości synchronizacji.

: 27 czerwca 2011, 23:29
autor: seba123
Niestety, nic z tego

Kod: Zaznacz cały

server:/etc/init.d# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sdb1[1] sda1[0]
      102398912 blocks [2/2] [UU]
        resync=DELAYED

md2 : active raid1 sdb2[1] sda2[0]
      857974720 blocks [2/2] [UU]
      [>....................]  resync =  0.7% (6544768/857974720) finish=8798.2min speed=1612K/sec

: 27 czerwca 2011, 23:31
autor: Unit
Pokaż:

Kod: Zaznacz cały

smartctl -a /dev/sda; smartctl -a /dev/sdb 

: 28 czerwca 2011, 01:05
autor: seba123

Kod: Zaznacz cały

smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is [URL]http://smartmontools.sourceforge.net/[/URL]

=== START OF INFORMATION SECTION ===
Device Model:     ST31000524AS
Serial Number:    6VPBVLS1
Firmware Version: JC45
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Jun 28 00:25:27 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:          ( 600) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 176) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x103f)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       145725627
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       34
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       23063161
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1765
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       34
183 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   060   050   045    Old_age   Always       -       40 (Lifetime Min/Max 40/44)
194 Temperature_Celsius     0x0022   040   050   000    Old_age   Always       -       40 (0 16 0 0)
195 Hardware_ECC_Recovered  0x001a   024   017   000    Old_age   Always       -       145725627
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       246385093904193
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3311924303
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       2335362556

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%         5         -
# 2  Short offline       Completed without error       00%         5         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is [URL]http://smartmontools.sourceforge.net/[/URL]

=== START OF INFORMATION SECTION ===
Device Model:     ST31000524AS
Serial Number:    5VP7BY69
Firmware Version: JC45
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Jun 28 00:27:03 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:          ( 600) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 176) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x103f)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   099   006    Pre-fail  Always       -       220332414
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       35
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       23010274
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1765
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       35
183 Unknown_Attribute       0x0032   073   073   000    Old_age   Always       -       27
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   088   088   000    Old_age   Always       -       12
188 Unknown_Attribute       0x0032   099   087   000    Old_age   Always       -       141737918551
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   059   049   045    Old_age   Always       -       41 (Lifetime Min/Max 41/45)
194 Temperature_Celsius     0x0022   041   051   000    Old_age   Always       -       41 (0 18 0 0)
195 Hardware_ECC_Recovered  0x001a   029   020   000    Old_age   Always       -       220332414
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       102336185763657
241 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       3833739856
242 Unknown_Attribute       0x0000   100   253   000    Old_age   Offline      -       4252604982

SMART Error Log Version: 1
ATA Error Count: 12 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 12 occurred at disk power-on lifetime: 1765 hours (73 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d5 40 0a 00  Error: UNC at LBA = 0x000a40d5 = 671957

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d0 40 0a e0 00      04:26:48.391  READ DMA
  27 00 00 00 00 00 e0 00      04:26:48.390  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      04:26:48.381  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      04:26:48.381  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      04:26:48.361  READ NATIVE MAX ADDRESS EXT

Error 11 occurred at disk power-on lifetime: 1765 hours (73 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d5 40 0a 00  Error: UNC at LBA = 0x000a40d5 = 671957

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d0 40 0a e0 00      04:26:44.736  READ DMA
  27 00 00 00 00 00 e0 00      04:26:44.735  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      04:26:44.726  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      04:26:44.698  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      04:26:44.698  READ NATIVE MAX ADDRESS EXT

Error 10 occurred at disk power-on lifetime: 1765 hours (73 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d5 40 0a 00  Error: UNC at LBA = 0x000a40d5 = 671957

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d0 40 0a e0 00      04:26:40.723  READ DMA
  27 00 00 00 00 00 e0 00      04:26:40.722  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      04:26:40.713  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      04:26:40.713  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      04:26:40.627  READ NATIVE MAX ADDRESS EXT

Error 9 occurred at disk power-on lifetime: 1765 hours (73 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d5 40 0a 00  Error: UNC at LBA = 0x000a40d5 = 671957

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d0 40 0a e0 00      04:26:36.819  READ DMA
  27 00 00 00 00 00 e0 00      04:26:36.818  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      04:26:36.809  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      04:26:36.808  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      04:26:36.681  READ NATIVE MAX ADDRESS EXT

Error 8 occurred at disk power-on lifetime: 1765 hours (73 days + 13 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 d5 40 0a 00  Error: UNC at LBA = 0x000a40d5 = 671957

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d0 40 0a e0 00      04:26:33.106  READ DMA
  27 00 00 00 00 00 e0 00      04:26:33.105  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 a0 00      04:26:33.096  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 00      04:26:33.095  SET FEATURES [Set transfer mode]
  27 00 00 00 00 00 e0 00      04:26:33.034  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Interrupted (host reset)      90%      1757         -
# 2  Short offline       Completed without error       00%         5         -
# 3  Short offline       Completed without error       00%         5         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Proszę.

: 28 czerwca 2011, 10:15
autor: Unit
SMART Error Log Version: 1
ATA Error Count: 12 (device log contains only the most recent five errors)
Wygląda mi to na problem z dyskiem /dev/sdb.