Home > Virtualisation, VMWare > ESXi host losing settings at reboot – checking system partitions of ESX host

ESXi host losing settings at reboot – checking system partitions of ESX host

On 3 separate occasions I have seen ESX hosts that appeared to lose their config at reboot.
I stumbled across a useful doc once and copied some of it, but can not remember where I found it..

Anyway, in all instances, the problem was caused by the bootbank being corrupted and I followed the following process to resolve the issue.

You see, the way VMWare operates is that 3 Hypervisor partitions are created and used for normal operation. the are mounted as /bootbank, /altbootbank and /store.

/store is simply used to ‘store’ data (e.g. VMTools isos and VI client etc) as well as information for the vCenter Server agent and the HA agent

Once you configure the Scratchconfig.ConfiguredScratchLocation paramater for an ESX host (swap file) it will mount a 4th partition for this purpose.

Anyway, the first two partitions mentioned above are used for the ‘running’ config and ‘saved’ config . . very loosely similarly to the way in which a Cisco router stores 2 different configs.
What happens with VMware though is that the ‘running’ config
/bootbank and /altbootbank are effectively the running copy of the ESX firmware / config and the last saved version.

VMWare backs up its running config every hours (at the1 minute after the hour)

to see if you have been getting updated backups (so there is a failure in the process) you can check for a an up to date stage.tgz file.

At the ESXi hgost, hit  ALT+F1 / then type in ‘unsupported’, press enter and login with the root password)

 ls -l /bootbank/

and check the timestamp of state.tgz (make sure this is 1 minute past an hour ago)

If this shows an old version, you could try force a backup:

/sbin/backup.sh 0 /bootbank/

Check again if  the timestamp on local.tgz gets updated

If not, we should try to fix corruption (if this is indeed the cause)

In order to identify the different paritions, you can use one of 2 methods:

~ # esxcfg-vmhbadevs -f
~ # ls -l | grep vmfs

what you need to do is simply identify which partitions are the /bootbank and /altbootbank partition, so that you can run a check.

to run a check on a partition, simply run dosfsck -v /dev/disks/<hba id>

dosfsck -v /dev/disks/vmhba1:0:0:4

Of course, using this tool, you can perform various other checks, so I have included the contents of the man page below:

dosfsck - check and repair MS-DOS file systems
dosfsck [-aAflnrtvVwy] [-d path -d ...] [-u path
-u ...] device
dosfsck verifies the consistency of MS-DOS file systems and optionally
tries to repair them. The following file system problems can be cor-
rected (in this order):
- FAT contains invalid cluster numbers. Cluster is changed to EOF.
- File's cluster chain contains a loop. The loop is broken.
- Bad clusters (read errors). The clusters are marked bad and they
are removed from files owning them. This check is optional.
- Directories with a large number of bad entries (probably cor-
rupt). The directory can be dropped.
- Files . and .. are non-directories. They can be dropped or
- Directories . and .. in root directory. They are dropped.
- Bad file names. They can be renamed.
- Duplicate directory entries. They can be dropped or renamed.
- Directories with non-zero size field. Size is set to zero.
- Directory . does not point to parent directory. The start
pointer is adjusted.
- Directory .. does not point to parent of parent directory. The
start pointer is adjusted.
- Start cluster number of a file is invalid. The file is trun-
- File contains bad or free clusters. The file is truncated.
- File's cluster chain is longer than indicated by the size
fields. The file is truncated.
- Two or more files share the same cluster(s). All but one of the
files are truncated. If the file being truncated is a directory
file that has already been read, the file system check is
restarted after truncation.
- File's cluster chain is shorter than indicated by the size
fields. The file is truncated.
- Clusters are marked as used but are not owned by a file. They
are marked as free.
Additionally, the following problems are detected, but not repaired:
- Invalid parameters in boot sector.
- Absence of . and .. entries in non-root directories
When dosfsck checks a file system, it accumulates all changes in memory
and performs them only after all checks are complete. This can be dis-
abled with the -w option.
-a Automatically repair the file system. No user intervention is
necessary. Whenever there is more than one method to solve a
problem, the least destructive approach is used.
-A Use Atari variation of the MS-DOS filesystem. This is default if
dosfsck is run on an Atari, then this option turns off Atari
format. There are some minor differences in Atari format: Some
boot sector fields are interpreted slightly different, and the
special FAT entries for end-of-file and bad cluster can be dif-
ferent. Under MS-DOS 0xfff8 is used for EOF and Atari employs
0xffff by default, but both systems recognize all values from
0xfff8...0xffff as end-of-file. MS-DOS uses only 0xfff7 for bad
clusters, where on Atari values 0xfff0...0xfff7 are for this
purpose (but the standard value is still 0xfff7).
-d Drop the specified file. If more that one file with that name
exists, the first one is dropped.
-f Salvage unused cluster chains to files. By default, unused clus-
ters are added to the free disk space except in auto mode (-a).
-l List path names of files being processed.
-n No-operation mode: non-interactively check for errors, but don't
write anything to the filesystem.
-r Interactively repair the file system. The user is asked for
advice whenever there is more than one approach to fix an incon-
-t Mark unreadable clusters as bad.
-u Try to undelete the specified file. dosfsck tries to allocate a
chain of contiguous unallocated clusters beginning with the
start cluster of the undeleted file.
-v Verbose mode. Generates slightly more output.
-V Perform a verification pass. The file system check is repeated
after the first run. The second pass should never report any
fixable errors. It may take considerably longer than the first
pass, because the first pass may have generated long list of
modifications that have to be scanned for each disk read.
-w Write changes to disk immediately.
-y Same as -a (automatically repair filesystem) for compatibility
with other fsck tools.
If -a and -r are absent, the file system is only checked, but not
0 No recoverable errors have been detected.
1 Recoverable errors have been detected or dosfsck has discovered
an internal inconsistency.
2 Usage error. dosfsck did not access the file system.
Does not create . and .. files where necessary. Does not remove
entirely empty directories. Should give more diagnostic messages.
Undeleting files should use a more sophisticated algorithm.
Werner Almesberger <werner.almesberger@lrc.di.epfl.ch> Extensions
(FAT32, VFAT) by and current maintainer: Roman Hodek <roman@hodek.net>
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: