This Blog is to share our knowledge and expertise on Linux System Administration and VMware Administration

Showing posts with label Linux Troubleshooting. Show all posts
Showing posts with label Linux Troubleshooting. Show all posts

Tuesday, December 19, 2017

Recover RPM DB from a corrupted RPM database in RHEL 7

Tuesday, December 19, 2017 0

Rebuilding corrupted rpm database in RHEL7

Situation:
Although everything is done to ensure that your RPM databases are intact, your RPM database may become corrupt and unuseable. This happens mainly if the filesystem on which the rpm db resides is suddenly inaccessible (full, read-only, reboot, or so on).

Solution:
1.Start by creating a backup of your corrupt rpm db, as follows:
[root@nsk ~]# tar zcvf rpm-db.tar.gz /var/lib/rpm/*
tar: Removing leading `/' from member names
/var/lib/rpm/Basenames
/var/lib/rpm/Conflictname
/var/lib/rpm/__db.001
/var/lib/rpm/__db.002
/var/lib/rpm/__db.003
/var/lib/rpm/Dirnames
/var/lib/rpm/Group
/var/lib/rpm/Installtid
/var/lib/rpm/Name
/var/lib/rpm/Obsoletename
/var/lib/rpm/Packages
/var/lib/rpm/Providename
/var/lib/rpm/Requirename
/var/lib/rpm/Sha1header
/var/lib/rpm/Sigmd5
/var/lib/rpm/Triggername

2.Remove stale lock files if they exist through the following command:
[root@nsk ~]# rm -f /var/lib/rpm/__db*

3.Now, verify the integrity of the Packages database via the following:
[root@nsk ~]# /usr/lib/rpm/rpmdb_verify /var/lib/rpm/Packages; echo $?
BDB5105 Verification of /var/lib/rpm/Packages succeeded.
0

If it prints 0, proceed to next step.

4. Rename the Packages file (don't delete it, we'll need it!), as follows:
[root@nsk ~]# mv /var/lib/rpm/Packages  /var/lib/rpm/Packages.org

5. Now, dump the Packages db from the original Packages db by executing the following command:
[root@nsk ~]# cd /var/lib/rpm/
 [root@nsk rpm]# /usr/lib/rpm/rpmdb_dump Packages.org | /usr/lib/rpm/rpmdb_load Packages
rpmdb_load: BDB1540 configured environment flags incompatible with existing environment

6.Verify the integrity of the newly created Packages database. Run the following:
[root@nsk rpm]#  /usr/lib/rpm/rpmdb_verify /var/lib/rpm/Packages; echo $?
BDB5105 Verification of /var/lib/rpm/Packages succeeded.
0

If the exit code is not 0, you will need to restore the database from backup.

7. Rebuild the rpm indexes, as follows:
[root@nsk ~]# rpm -vv --rebuilddb
[root@nsk rpm]# rpm -vv --rebuilddb
D: rebuilding database /var/lib/rpm into /var/lib/rpmrebuilddb.1312
D: opening  db environment /var/lib/rpm private:0x401
D: opening  db index       /var/lib/rpm/Packages 0x400 mode=0x0
D: locked   db index       /var/lib/rpm/Packages
D: opening  db environment /var/lib/rpmrebuilddb.1312 private:0x401
D: opening  db index       /var/lib/rpmrebuilddb.1312/Packages (none) mode=0x42
D: opening  db index       /var/lib/rpmrebuilddb.1312/Packages 0x1 mode=0x42
D: disabling fsync on database
....
...
D: adding "5f7fd424d0773a4202731bff4901d449699b0929" to Sha1header index.
D: closed   db index       /var/lib/rpm/Packages
D: closed   db environment /var/lib/rpm
D: closed   db index       /var/lib/rpmrebuilddb.1312/Sha1header
D: closed   db index       /var/lib/rpmrebuilddb.1312/Sigmd5
D: closed   db index       /var/lib/rpmrebuilddb.1312/Installtid
D: closed   db index       /var/lib/rpmrebuilddb.1312/Dirnames
D: closed   db index       /var/lib/rpmrebuilddb.1312/Triggername
D: closed   db index       /var/lib/rpmrebuilddb.1312/Obsoletename
D: closed   db index       /var/lib/rpmrebuilddb.1312/Conflictname
D: closed   db index       /var/lib/rpmrebuilddb.1312/Providename
D: closed   db index       /var/lib/rpmrebuilddb.1312/Requirename
D: closed   db index       /var/lib/rpmrebuilddb.1312/Group
D: closed   db index       /var/lib/rpmrebuilddb.1312/Basenames
D: closed   db index       /var/lib/rpmrebuilddb.1312/Name
D: closed   db index       /var/lib/rpmrebuilddb.1312/Packages
D: closed   db environment /var/lib/rpmrebuilddb.1312

8. Use the following command to check the rpm db with yum for any other issues (this may take a long time):
[root@nsk rpm]# yum check
Loaded plugins: fastestmirror
....
...

9. Restore the SELinux context of the rpm database through the following command:
[root@nsk rpm]# restorecon -R -v /var/lib/rpm

Saturday, November 25, 2017

Virtual machines show warning messages when starting the udev daemon Linux

Saturday, November 25, 2017 0

Virtual machines show warning messages when starting the udev daemon.

After upgrading VMware Tools,  Linux virtual machines show warnings when starting the udev daemon.

dmesg shows the below messages.

Starting udev:
udevd[572]: add_to_rules: unknown key 'SUBSYSTEMS'
udevd[572]: add_to_rules: unknown key 'ATTRS{vendor}'
udevd[572]: add_to_rules: unknown key 'ATTRS{model}'
udevd[572]: add_to_rules: unknown key 'SUBSYSTEMS'
udevd[572]: add_to_rules: unknown key 'ATTRS{vendor}'
udevd[572]: add_to_rules: unknown key 'ATTRS{model}'

Ctrl+C will bypass udev daemon to finish the boot process.

To disable the warning message, comment out unused lines (ubuntu  & other type of unix entries) in the  /etc/udev/rules.d/99-vmware-scsi-udev.rule file

For linux we need to modify the below line from

ACTION=="add", BUS=="scsi", SYSFS{vendor}=="VMware, " , SYSFS{model}=="VMware Virtual S", RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

To

ACTION=="add", BUS=="scsi", SYSFS{vendor}=="VMware " , SYSFS{model}=="Virtual disk ", RUN+="/bin/sh -c 'echo 180 >/sys$DEVPATH/device/timeout'"

Save the modifiation and reboot the virtual machine.

Friday, November 17, 2017

Error "system was unable to find a physical volume" SOLVED -Step by Step

Friday, November 17, 2017 0

If we get Error  "system was unable to find a physical volume" . It needs  to restore  the corrupted Volume Group


Situation :

If the volume group metadata area of a physical volume is accidentally overwritten or otherwise destroyed, you will get an error message indicating that the metadata area is incorrect, or that the system was unable to find a physical volume with a particular UUID. You may be able to recover the data the physical volume by writing a new metadata area on the physical volume specifying the same UUID as the lost metadata.

Solution:

The following example shows the sort of output you may see if the metadata area is missing or corrupted.

[root@test]# lvs -a -o +devices

  Couldn't find device with uuid 'zhtUGH-1N2O-tHdu-b14h-gH34-sB7z-NHhkdf'.
  Couldn't find all physical volumes for volume group VG.
  Couldn't find device with uuid 'zhtUGH-1N2O-tHdu-b14h-gH34-sB7z-NHhkdf'.
  Couldn't find all physical volumes for volume group VG.

  ...

You may be able to find the UUID for the physical volume that was overwritten by looking in the /etc/lvm/archive directory. Look in the file VolumeGroupName_xxxx.vg for the last known valid archived LVM metadata for that volume group.

Alternately, you may find that deactivating the volume and setting the partial (-P) argument will enable you to find the UUID of the missing corrupted physical volume.

[root@test]# vgchange -an --partial

  Partial mode. Incomplete volume groups will be activated read-only.
  Couldn't find device with uuid 'zhtUGH-1N2O-tHdu-b14h-gH34-sB7z-NHhkdf'.
  Couldn't find device with uuid 'zhtUGH-1N2O-tHdu-b14h-gH34-sB7z-NHhkdf'.

  ...

Use the --uuid and --restorefile arguments of the pvcreate command to restore the physical volume. The following example labels the /dev/sdh1 device as a physical volume with the UUID indicated above, zhtUGH-1N2O-tHdu-b14h-gH34-sB7z-NHhkdf. This command restores the physical volume label with the metadata information contained in centos_00000-1802035441.vg, the most recent good archived metatdata for volume group .

The restorefile argument instructs the pvcreate command to make the new physical volume compatible with the old one on the volume group, ensuring that the the new metadata will not be placed where the old physical volume contained data (which could happen, for example, if the original pvcreate command had used the command line arguments that control metadata placement, or it the physical volume was originally created using a different version of the software that used different defaults).

The pvcreate command overwrites only the LVM metadata areas and does not affect the existing data areas.

[root@test]# pvcreate --uuid "zhtUGH-1N2O-tHdu-b14h-gH34-sB7z-NHhkdf" --restorefile /etc/lvm/archive/centos_00000-1802035441.vg /dev/sdh1
  Physical volume "/dev/sdh1" successfully created

You can then use the vgcfgrestore command to restore the volume group's metadata.

[root@test]# vgcfgrestore VG
  Restored volume group VG 

You can now display the logical volumes.

[root@test]# lvs -a -o +devices

  LV     VG   Attr   LSize   Origin Snap%  Move Log Copy%  Devices

  stripe VG   -wi--- 300.00G                               /dev/sdh1 (0),/dev/sda1(0)
  stripe VG   -wi--- 300.00G                               /dev/sdh1 (34728),/dev/sdb1(0) 

The following commands activate the volumes and display the active volumes.

[root@test]# lvchange -ay /dev/VG/stripe
[root@test]# lvs -a -o +devices

  LV     VG   Attr   LSize   Origin Snap%  Move Log Copy%  Devices
  stripe VG   -wi-a- 300.00G                               /dev/sdh1 (0),/dev/sda1(0)
  stripe VG   -wi-a- 300.00G                               /dev/sdh1 (34728),/dev/sdb1(0)

If the on-disk LVM metadata takes as least as much space as what overrode it, this command can recover the physical volume. If what overrode the metadata went past the metadata area, the data on the volume may have been affected. You might be able to use the fsck command to recover that data


Tuesday, November 14, 2017

Server hang at GRUB during boot - SOLVED

Tuesday, November 14, 2017 0
If a RHEL server hangs on boot with nothing more than the word GRUB in the upper left hand corner of the screen, this means that GRUB is unable to read its configuration file. If you actually get a GRUB menu, but the server does not boot then you have different and potentially more complex issue.

The most common reason for GRUB being unable to read its configuration is caused by a discrepancy between how the BIOS enumerated the hard drives and what GRUB expects to be its boot disk.


To correct this issue, boot the server in rescue mode.


Once booted into rescue mode and your root disk filesystems have been mounted. Check the /boot/grub/device.map file to ensure it has correctly identified the boot disk. hd0 should point to the disk that contains /boot. On an HP Proliant system you should see the following line:


(hd0) /dev/cciss/c0d0


If it does not, correct the file and then update GRUB by issuing the following command:


/sbin/grub --batch --device-map=/boot/grub/device.map --config-file=/boot/grub/grub.conf --no-floppy


And then from the GRUB prompt enter the following commands:


grub> root (hd0,0)
grub> setup (hd0)
grub> quit


You can now eject the ISO and reboot the server normally.

Sunday, November 12, 2017

BUG: soft lockup - CPU#0 stuck for 10s!

Sunday, November 12, 2017 0

•Soft lockups are situations in which the kernel's scheduler  subsystem has not been given a chance to perform its job for more than  10 seconds; they can be caused by defects in the kernel, by hardware  issues or by extremely high workloads.

Run following command and check whether you still encounter these "soft lockup" errors on the system:

# sysctl -w kernel.softlockup_thresh=30

To make this parameter persistent across reboots by adding following line in /etc/sysctl.conf file:

 kernel.softlockup_thresh=30


Note: The softlockup_thresh kernel parameter was introduced in Red Hat Enterprise Linux 5.2 in kernel-2.6.18-92.el5 thus it is not possible to modify this on older versions

SOLVED : Buffer I/O error on boot

Sunday, November 12, 2017 0
Situation:

•After upgrading from Red Hat Enterprise Linux (RHEL) 5.1 to RHEL 5.5 (kernel 2.6.18-53.el5 to 2.6.18-194.8.1.el5), a system started to show IO errors on boot.

•The boot process took more time than before, but there are otherwise no significant problems occuring.


SCSI device sdc: 419430400 512-byte hdwr sectors (214748 MB)

sdc: Write Protect is off
sdc: Mode Sense: 77 00 10 08
SCSI device sdc: drive cache: write back w/ FUA
SCSI device sdc: 419430400 512-byte hdwr sectors (214748 MB)
sdc: Write Protect is off
sdc: Mode Sense: 77 00 10 08
SCSI device sdc: drive cache: write back w/ FUA
sdc:end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
Dev sdc: unable to read RDB block 0
end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
end_request: I/O error, dev sdc, sector 0
Buffer I/O error on device sdc, logical block 0
unable to read partition table
sd 1:0:0:1: Attached scsi disk sdc
 Vendor: SUN       Model: LCSM100_S         Rev: 0735
 Type:   Direct-Access                      ANSI SCSI revision: 05

Solution:


follow below solution to remediate above issue.


•Switching the controller to active/active mode would allow the devices to be probed through both controller ports and prevent the errors.

•An option to speed up the boot process is to rebuild the initrd without the HBA driver kernel modules and then probe the devices post boot, ie

# mkinitrd -v -f --omit-scsi-modules /boot/initrd-2.6.18-194.8.1.el5.img 2.6.18-194.8.1.el5

Friday, November 3, 2017

SOLVED : pam_ldap: error trying to bind as user

Friday, November 03, 2017
If you are getting error below after giving the correct password

Nov  2 03:56:42 testserver sshd[30173]: pam_ldap: error trying to bind as user "uid=testuser,ou=People,dc=test,dc=testdomain,dc=com" (Invalid credentials)

Nov  2 03:56:43 testserver sshd[30173]: Failed password for testuser from 10.17.0.3 port 51306 ssh2

Reason: Password is not syncing properly to all client server during the scheduled window

Solution : Restart the slapd service on LDAP server & it will sync to all server.

#/etc/init.d/slapd restart

Hope it helps.

Wednesday, November 1, 2017

How do I exclude Kernel or other packages from getting updated in RHEL while updating system via yum?

Wednesday, November 01, 2017 0

Excluding  Kernel or other packages from getting updated in RHEL while updating system via yum

The up2date command in Red Hat Enterprise Linux 4 excludes kernel updates by default. The yum in Red Hat Enterprise Linux 5  includes kernel updates by default.

 To skip installing or updating kernel or other packages while using the yum update utility in Red Hat Enterprise Linux 5 and 6 use following options:

 Temporary solution via Command line:

 # yum update --exclude=PACKAGENAME

For example, to exclude all kernel related packages:

# yum update --exclude=kernel*

To make permanent changes, edit the /etc/yum.conf file and following entries to it:

[main]
cachedir=/var/cache/yum/$basearch/$releasever
keepcache=0
debuglevel=2
logfile=/var/log/yum.log
exclude=kernel* redhat-release*                           <====

NOTE: If there are multiple package to be excluded then separate them using a single space or comma. Also, do not add multiple 


exclude= lines in the configuration file because yum only considers the last exclude entry.

To exclude 32 bit packages edit /etc/yum.conf file.

exclude=*.i?86 *.i686

Tuesday, October 24, 2017

How to solve the Error "sendmail dead but subsys locked" sm-client (pid 28752) is running?

Tuesday, October 24, 2017 0
 Error "sendmail dead but subsys locked" sm-client (pid  28752) is running - This is because of 2 MTA (Mail Transfer Agent) were sunning same time. Something is trying to start the postfix service also cause this issue.

[root@testserver ~]# /etc/init.d/sendmail status
sendmail dead but subsys locked
sm-client (pid  28752) is running...
First check postfix is running on the server

[root@testserver ~]# /etc/init.d/postfix status
-b (pid  1765) is running...
[root@testserver ~]#


Try to stop the service if not able to bring down the service & kill the process. Then restart the sendmail service.

[root@testserver ~]# /etc/init.d/postfix stop
Shutting down postfix:                                     [FAILED]
[root@testserver ~]#

[root@testserver ~]# ps -ef | grep -i postfix
root      1765     1  0 Jun09 ?        00:02:06 /usr/libexec/postfix/master
postfix   1772  1765  0 Jun09 ?        00:00:03 qmgr -l -t fifo -u
root     25822 24576  0 16:56 pts/7    00:00:00 grep -i postfix

[root@testserver ]# kill -9 1765
[root@testserver ]#

[root@testserver ]# /etc/init.d/sendmail restart
Shutting down sm-client:                                   [  OK  ]
Shutting down sendmail:                                    [  OK  ]
Starting sendmail:                                         [  OK  ]
Starting sm-client:                                        [  OK  ]
[root@testserver ]#

[root@testserver ]# /etc/init.d/sendmail status
sendmail (pid  28421) is running...
sm-client (pid  28429) is running...

Hope it helps

Thursday, October 19, 2017

Kernel: WARNING calibrate_APIC_clock: the APIC timer calibration may be wrong appear on Guest 5.x Linux VM's

Thursday, October 19, 2017 0
This was due to the MAX_DIFFERENCE parameter value (in the APIC calibration loop) of 1000 cycles being too aggressive for virtual guests. APIC (Advanced Programmable Interrupt Controllers) and TSC (Time Stamp Counter) reads normally take longer than 1000 cycles when performed from inside a virtual guest, due to processors being scheduled away from and then back onto the guest. With this update, the MAX_DIFFERENCE parameter value has been increased to 10,000 for virtual guests.

These messages can be stopped by adding ‘apiccalibrationdiff=10000’ to guest kernel in /etc/grub.conf.

Friday, October 13, 2017

How to reduce / file system utilization in Linux Server?

Friday, October 13, 2017 0
Reducing / file system utilization in linux server is very rare part.
or we can say, / file system is full, how to do housekeeping?

Situation:

We have separate mount of /boot /usr /tmp /home /var /opt file system but still / file system utilization is almost full.

Solution:

First check under / directory which are the file systems are not mounted, collect it & run the below command

For Ex: Below listed are not mounted, so we need to check which one is huge size.

admin lib middleware net  lib64 srv misc  media mnt

[root@testserver]# du -sk /admin /lib /middleware /net  /lib64 /srv /misc  /media /mnt | sort -n

4       /srv
16      /mnt
20      /middleware
31852   /lib64
920388  /lib


So lets see what's under /lib

[root@testserver ]# du -sk /lib/* | sort -n
109096  /lib/firmware
793332  /lib/modules


looks modules is huge size, lets check that one also

du -sk /lib/modules/* | sort -n

105840  /lib/modules/2.6.32-431.20.3.el6.x86_64
107040  /lib/modules/2.6.39-400.215.3.el6***.x86_64
109152  /lib/modules/2.6.32-504.23.4.el6.x86_64
116676  /lib/modules/2.6.32-696.1.1.el6.x86_64
176888  /lib/modules/3.8.13-68.3.3.el6***.x86_64
177732  /lib/modules/3.8.13-118.17.5.el6***.x86_64


Now Check the current kernal, which running on the server

[root@testserver firmware]# uname -a
Linux testserver 3.8.13-118.17.5.el6***.x86_64 #2 SMP Wed Apr 12 09:16:08 PDT 2017 x86_64 x86_64 x86_64 GNU/Linux


Check, what are the packages are related to 2.6 (old) kernel

#rpm -qf /lib/modules/2.6.*

kernel-2.6.32-431.20.3.el6.x86_64
kernel-2.6.32-504.23.4.el6.x86_64
kernel-2.6.32-696.1.1.el6.x86_64
kernel-uek-2.6.39-400.215.3.el6***.x86_64


We can see some old kernel is still available. So remove that one

#yum remove kernel-2.6.32-431.20.3.el6.x86_64

We can get some space if not sufficient then remove other (unused) older kernal.

Hope it helps.

Tuesday, June 7, 2016

When up2date/yum fail with "Error Class Code 31" - Solved.

Tuesday, June 07, 2016 0
Whenever Running up2date or yum update fails with below error

Error Message    : Service not enabled for system profile: "system1.example.com"

Error Class Code: 31
Error Class Info   :This system does not have a valid entitlement for Red Hat Network.

    Please visit https://rhn-server/rhn/systems/SystemEntitlements. or 

    login at https://rhn-server, and from the "Overview" tab,
    select "Subscription Management" to enable Redhat Network service for this system.
Situation

  • System registration fails with above error.
  • Redhat Network entitlements missing after Redhat contract renewal.
  • After executing rhn_register, the system appears in host list, but as unentitled.
  • Cannot entitle system.
  • System does not have a valid entitlement for Red Hat Network.
  • When trying to install a package, an error was received that said the system does not have a valid entitlement.
  • No longer able to update system.
  • Satellite certificate activation is failing with "Error Class Code 31"?

Resolution

If the system is not registered with rhn-server, follow the below steps to have an entitlement.


  • Log in to Satellite Customer Portal
  • Click on My Subscriptions
  • Under Redhat Network Classic select All Registered Systems
  • Click on system name
  • Click on Edit These Properties beside System Properties
  • Ensure either Update or Management is selected for Base Entitlement.
  • Click the Update Properties button located in the bottom-right corner.

Root Cause

  • Error Class Code: 31 means that a valid entitlement is not assigned to your system profile.
  • When you register a system, the base entitlement gets assigned to either Update / Management (as per the free entitlement in account) along with the base channel. But if the base entitlement is removed for the system profile then while updating the system it fails with Error Class Code: 31