About host events

During the lifespan of a virtual machine (VM) instance or bare metal instance, the host machine that your instance runs on can experience multiple host events. A host event can include the regular maintenance of Compute Engine infrastructure, or, in rare cases a host error. You can choose how your VM and bare metal instances respond during or after a host event by configuring the host maintenance policy.

By default, most instances are set to live migrate during host events. You can override this behavior and explicitly set the instances to terminate and optionally restart. Some machine types don't support live migration, such as Z3 instances with more than 18 TiB of attached Titanium SSD, bare metal instances, or instances with attached GPUs. These instances terminate during host events. For more information, see Maintenance and restart behaviors.

Types of host events

There are two types of host events, which are described in more detail in the following sections:

If your instance becomes unresponsive, then this can also trigger a restart or termination of the instance.

Maintenance events

A maintenance event is when Compute Engine has to perform a maintenance or repair activity that requires VMs to be moved out of the host server. If you enable the live migration host maintenance policy for a supported instance type, then Compute Engine moves the instance to a new host, and there is minimal disruption to your application.

Compute Engine also applies some lightweight hypervisor and network upgrades in the background nondisruptively by retaining the instance on the same host.

Instance behavior during a maintenance event can vary depending on the tenancy of the instance and the machine type. You can find information about the maintenance behavior for each machine type on the respective machine family page, as follows:

For information about the maintenance policies for specific machine series, see the machine series comparison.

For sole-tenant VMs, the approximate frequency of planned host maintenance events is every 4 to 6 weeks. Whether or not live migration is supported depends on the host maintenance policy for the sole-tenant VM.

Host errors

A host error (compute.instances.hostError) means that there was a hardware or software issue on the physical machine or the data center infrastructure hosting your compute instance that caused your instance to crash. A host error involving a total hardware failure or other hardware issues might prevent the live migration of your instance. If your instance is set to automatically restart, which is the default setting, Compute Engine restarts your instance, typically within three minutes from the time the error was detected. Depending on the issue, the restart might take up to 5.5 minutes.

Occasionally, a compute instance might become unresponsive before a host error is signaled. You can reduce the amount of time Compute Engine waits to restart or terminate the instance by setting the host error recovery timeout. For more information, see Set availability policies.

Physical hardware and software failures can happen occasionally but are rare occurrences. To protect your applications and services from these potentially disruptive system events, review the following resources:

Host maintenance policy overview

An instance's host maintenance policy determines how it behaves during the following host events:

  • Maintenance event
  • Host error event or instance not responding

You can configure instances to continue running during host maintenance, while Compute Engine live migrates them to another host or you can choose to stop your instance instead.

You can change a instance's host maintenance policy by configuring the following settings:

  • Maintenance behavior: whether the instance is live migrated or stopped when there is a maintenance event.
  • Restart behavior: whether Compute Engine restarts or terminates the instance if the instance crashes, experiences a host error, or becomes unresponsive.
  • Host error detection time: the maximum amount of time that Compute Engine waits to restart or terminate an instance after detecting that the instance is unresponsive.

You can update an instance's host maintenance policy at any time to control how you want your instances to behave.

Maintenance and restart behaviors

When a host event occurs, the compute instance can either use live migration, or the instance can be terminated. If an instance is terminated, then you can choose to restart the instance yourself or have Compute Engine automatically restart it.

The following machine series might not support live migration, and instead require termination during host events:

Live migrate

By default, most instance types are set to live migrate, excluding the instance types mentioned in the previous section.

During live migration, Compute Engine automatically migrates your instance away from an infrastructure maintenance event, and your instance remains running during the migration. Your instance might experience a short period of decreased performance, but in general, most instances shouldn't perform noticeably different. This is ideal for instances that require constant uptime and can tolerate a short period of decreased performance.

When Compute Engine migrates your instance, it reports a system event that is published to the list of zone operations and to the System Events logs. You can review this event by viewing the Compute Engine operations for a specific zone. Live migration events have the following operation type:

compute.instances.migrateOnHostMaintenance

Terminate and restart

If you don't want your instance to live migrate, or if your instance type doesn't support live migration, then you can instead choose to allow Trusted Cloud by S3NS to stop the instance when a host event occurs. With this configuration, if a host event occurs, then Compute Engine sends a soft power-off signal to shut down the instance. It then waits 60 seconds for the instance to shut down cleanly, and sets the instance status to TERMINATED. If the instance doesn't shut down cleanly in 60 seconds, then it is forcibly terminated.

This option is ideal if your instances demand constant, maximum performance, and if your overall application is built to handle instance failures or reboots.

When Compute Engine stops an instance because of a host event, it reports a system event that is published to the list of zone operations and to the System Events logs. You can review this event by viewing the Compute Engine operations for a specific zone. Instance termination events have the following operation type:

compute.instances.terminateOnHostMaintenance

Automatic restart

If your instance is configured to stop when there is a maintenance event, or if your instance crashes because of an underlying hardware issue, then Compute Engine can automatically restart the instance. The instance is either restarted on the same host server, or moved to another server in the same zone that isn't participating in the maintenance event.

By default, Compute Engine tries to recover instances with attached Local SSD disks for one hour. If the time limit is reached, then Compute Engine attempts to restart the instance on a different host server in the same zone.

To configure automatic restart, set the host maintenance policy field automaticRestart to true. This setting does not apply if the instance is taken offline due to a zonal outage or through manual operation, such as calling sudo shutdown within the guest OS.

When Compute Engine automatically restarts your instance, it reports a system event that is published to the list of zone operations. You can review this event by viewing the Compute Engine operations for a specific zone. Automatic restart events have the following operation type:

compute.instances.automaticRestart

Disk persistence following instance termination

Because Hyperdisk are network-attached storage, when your instance restarts, Compute Engine reattaches the boot disk and any secondary disks to the instance. The data on those disks persists through live migration and instance restarts.

Maintenance scheduling

Trusted Cloud by S3NS provides features that allow tighter control around maintenance. By using certain machine families, you can specify maintenance preferences and get notifications of upcoming maintenance events through Cloud Logging, the instance's metadata server, the gcloud CLI compute instances describe command or the REST instances.describe method. Upon receipt of a notification, you have a period of time in which you can start the scheduled maintenance at a time you choose. If you don't trigger the scheduled maintenance, then the maintenance event occurs at the end of the notification time period, which is the scheduled time listed in the notification.

You can use these features in combination with your host maintenance policy to customize a maintenance schedule that fits your workload.

What's next