Patching and Upgrades

Page Contents

The Need for Patching and Upgrades

It is essential that we keep our platforms and software current. This is in order to:

  • Keep our software patched against any known security vulnerabilites.
  • Avoid falling so far behind that upgrading becomes costly and infeasible.
  • Be able to leverage vendor SLAs, and to meet our own internal recovery time objectives (RTOs).
  • Be able to benefit from new features.
  • Limit technical debt.

Note: there is some overlap between the terms patching and upgrading. Typically:

  • Patching is typically about applying a software update that addresses a specific vulnerability or vulnerabilities, or a specific defect. Patching may cause an increment in the software’s minor version number. (E.g. 4.1.2 –> 4.1.3.)
  • Upgrading is typically a more significant software update. It may result in the remediation of many vulnerabilities or defects; it may introduce significant changes in functionality; it may roll-up several previous patches. It will typically increment software’s major version number. (E.g. 4.1 –> 4.2.)

Regardless of whether an update is considered patching or upgrading, the underlying approach and mechanisms are typically similar. Consequently, for the rest of this content, they will always be treated as the same thing.

Google Managed Services

Resources that are fully-managed by Google (such as Cloud Run, App Engine, Cloud Storage, Cloud SQL, Cloud BigQuery, Cloud Pub/Sub, Cloud Load Balancing, Cloud KMS, Cloud Container Registry) will be patched and upgraded by Google. This is typically transparent to the tenant. However, maintenance windows will sometimes need to be specified. E.g. for Cloud SQL instances.

Google Kubernetes Engine (GKE)

  • The GKE Control Plane is patched and upgraded by Google. This is completely transparent to any tenant.
  • GKE nodes (i.e. the servers where workloads are hosted) are also automatically upgraded by Google.
    • This includes upgrading the operating system, and Kubernetes itself.
    • This is typically non-disruptive to tenants, since we limit the number of nodes that can be upgraded in parallel. Standard K8s workload scheduling ensures that workloads remain available even when worker nodes are unavailable.
  • We use Google Release Channels to determine when our GKE clusters are upgraded:
    • Non-Production (Npd and Sandbox) clusters are enrolled in the Regular release channel. This means that these servers will typically be upgraded at least 3 months after the Kubernentes release has reached general availability.
    • Production clusters are enrolled in the Stable release channel. This means that these servers will typically be upgraded 2-3 months later than the Non-Prod servers.
    • Upgrades can be rolled back, if required.

Unmanaged Services

This typically refers to IaaS-type products. For example, Google Compute Engine (GCE) provides virtual machine instances on Google Cloud; but the patching and upgrading of all software on those instances - including operating systems, application servers, self-managed databases - is our responsibility.

Image Baking

To ensure that we always use secure, CIS-compliant, patched operating system images, we start by taking a standard CIS-compliant hardened Shielded VM image from Google.

We then apply some additional organisational configuration, using Hashicorp Packer. The result is a CIS-compliant, organisation gold image for a given OS.

graph LR GImg[Google CIS-Compliant
Shielded VM Image] -- Packer --> OrgImg[Org CIS-Compliant
Gold Image] OrgImg -- Push --> GImgSt[Google Image Storage] classDef default fill:#2874A6,stroke:#555,color:white; linkStyle default fill:none,color:black;

This process is automatic through a CI/CD pipeline, resulting in our images being refreshed on a weekly basis.

Deploying Instances

We are now able to build GCE instances (VMs) from our image. On LZiaB, our security policy will only permit the use of Shielded VM images. I.e. Google images that are already in the Shielded category, or organisational CIS-compliant gold images.

graph LR GImg[Org/Google CIS-Compliant
Shielded VM Image] -- "Deploy
instance" --> GCE[GCE Instance] GCE -- "Apply
Startup Script" --> GCE_Ans[GCE Instance
registered with Ansible] GCE_Ans -- "Apply
agents and patches" --> GCE_Go[Instance
Ready for Use] classDef default fill:#2874A6,stroke:#555,color:white; linkStyle default fill:none,color:black;

Patching

  • Ansible will push any updates to built instances on a routine schedule.
  • Non-prod instances are always patched before prod instances.

Image Refresh

Twice yearly, any deployed instances will be destroyed and recreated, using the latest version of the OS image.