ESXi host losing settings at reboot – checking system partitions of ESX host
On 3 separate occasions I have seen ESX hosts that appeared to lose their config at reboot.
I stumbled across a useful doc once and copied some of it, but can not remember where I found it..
Anyway, in all instances, the problem was caused by the bootbank being corrupted and I followed the following process to resolve the issue.
You see, the way VMWare operates is that 3 Hypervisor partitions are created and used for normal operation. the are mounted as /bootbank, /altbootbank and /store.
/store is simply used to ‘store’ data (e.g. VMTools isos and VI client etc) as well as information for the vCenter Server agent and the HA agent
Once you configure the Scratchconfig.ConfiguredScratchLocation paramater for an ESX host (swap file) it will mount a 4th partition for this purpose.
Anyway, the first two partitions mentioned above are used for the ‘running’ config and ‘saved’ config . . very loosely similarly to the way in which a Cisco router stores 2 different configs.
What happens with VMware though is that the ‘running’ config
/bootbank and /altbootbank are effectively the running copy of the ESX firmware / config and the last saved version.
VMWare backs up its running config every hours (at the1 minute after the hour)
to see if you have been getting updated backups (so there is a failure in the process) you can check for a an up to date stage.tgz file.
At the ESXi hgost, hit ALT+F1 / then type in ‘unsupported’, press enter and login with the root password)
Run ls -l /bootbank/
and check the timestamp of state.tgz (make sure this is 1 minute past an hour ago)
If this shows an old version, you could try force a backup:
/sbin/backup.sh 0 /bootbank/
Check again if the timestamp on local.tgz gets updated
If not, we should try to fix corruption (if this is indeed the cause)
In order to identify the different paritions, you can use one of 2 methods:
~ # esxcfg-vmhbadevs -f
~ # ls -l | grep vmfs
what you need to do is simply identify which partitions are the /bootbank and /altbootbank partition, so that you can run a check.
to run a check on a partition, simply run dosfsck -v /dev/disks/<hba id>
e.g. dosfsck -v /dev/disks/vmhba1:0:0:4
Of course, using this tool, you can perform various other checks, so I have included the contents of the man page below:
NAME dosfsck - check and repair MS-DOS file systems
SYNOPSIS dosfsck [-aAflnrtvVwy] [-d path -d ...] [-u path -u ...] device
DESCRIPTION dosfsck verifies the consistency of MS-DOS file systems and optionally tries to repair them. The following file system problems can be cor- rected (in this order):
- FAT contains invalid cluster numbers. Cluster is changed to EOF. - File's cluster chain contains a loop. The loop is broken. - Bad clusters (read errors). The clusters are marked bad and they are removed from files owning them. This check is optional. - Directories with a large number of bad entries (probably cor- rupt). The directory can be dropped. - Files . and .. are non-directories. They can be dropped or renamed. - Directories . and .. in root directory. They are dropped. - Bad file names. They can be renamed. - Duplicate directory entries. They can be dropped or renamed. - Directories with non-zero size field. Size is set to zero. - Directory . does not point to parent directory. The start pointer is adjusted. - Directory .. does not point to parent of parent directory. The start pointer is adjusted. - Start cluster number of a file is invalid. The file is trun- cated. - File contains bad or free clusters. The file is truncated. - File's cluster chain is longer than indicated by the size fields. The file is truncated. - Two or more files share the same cluster(s). All but one of the files are truncated. If the file being truncated is a directory file that has already been read, the file system check is restarted after truncation. - File's cluster chain is shorter than indicated by the size fields. The file is truncated. - Clusters are marked as used but are not owned by a file. They are marked as free.
Additionally, the following problems are detected, but not repaired:
- Invalid parameters in boot sector. - Absence of . and .. entries in non-root directories
When dosfsck checks a file system, it accumulates all changes in memory and performs them only after all checks are complete. This can be dis- abled with the -w option.
OPTIONS -a Automatically repair the file system. No user intervention is necessary. Whenever there is more than one method to solve a problem, the least destructive approach is used.
-A Use Atari variation of the MS-DOS filesystem. This is default if dosfsck is run on an Atari, then this option turns off Atari format. There are some minor differences in Atari format: Some boot sector fields are interpreted slightly different, and the special FAT entries for end-of-file and bad cluster can be dif- ferent. Under MS-DOS 0xfff8 is used for EOF and Atari employs 0xffff by default, but both systems recognize all values from 0xfff8...0xffff as end-of-file. MS-DOS uses only 0xfff7 for bad clusters, where on Atari values 0xfff0...0xfff7 are for this purpose (but the standard value is still 0xfff7).
-d Drop the specified file. If more that one file with that name exists, the first one is dropped.
-f Salvage unused cluster chains to files. By default, unused clus- ters are added to the free disk space except in auto mode (-a).
-l List path names of files being processed.
-n No-operation mode: non-interactively check for errors, but don't write anything to the filesystem.
-r Interactively repair the file system. The user is asked for advice whenever there is more than one approach to fix an incon- sistency.
-t Mark unreadable clusters as bad.
-u Try to undelete the specified file. dosfsck tries to allocate a chain of contiguous unallocated clusters beginning with the start cluster of the undeleted file.
-v Verbose mode. Generates slightly more output.
-V Perform a verification pass. The file system check is repeated after the first run. The second pass should never report any fixable errors. It may take considerably longer than the first pass, because the first pass may have generated long list of modifications that have to be scanned for each disk read.
-w Write changes to disk immediately.
-y Same as -a (automatically repair filesystem) for compatibility with other fsck tools.
If -a and -r are absent, the file system is only checked, but not repaired.
EXIT STATUS 0 No recoverable errors have been detected.
1 Recoverable errors have been detected or dosfsck has discovered an internal inconsistency.
2 Usage error. dosfsck did not access the file system.
BUGS Does not create . and .. files where necessary. Does not remove entirely empty directories. Should give more diagnostic messages. Undeleting files should use a more sophisticated algorithm.
AUTHORS Werner Almesberger <email@example.com> Extensions (FAT32, VFAT) by and current maintainer: Roman Hodek <firstname.lastname@example.org>