Cloud Ops Dashboard infrastructure operations

This document describes operational guidance for Cloud Ops Dashboard infrastructure. This service is operated on the Managed Services Platform (MSP).

If you need assistance with MSP infrastructure, reach out to the Core Services team in #discuss-core-services.

Service overview

PROPERTYDETAILS
Service IDcloud-ops (specification)
Ownerscloud
Service kindCloud Run service
Environmentsprod, dev
Docker imageus-central1-docker.pkg.dev/control-plane-5e9ee072/docker/apiserver
Source codegithub.com/sourcegraph/controller - cmd/apiserver

Environments

prod

PROPERTYDETAILS
Project IDcloud-ops-prod-dd32
Categoryinternal
Deployment typesubscription
Resourcesprod Redis
Slack notifications#alerts-cloud-ops-prod
Alert policiesGCP Monitoring alert policies list, Dashboard
ErrorsSentry cloud-ops-prod
Domaincloud-ops.sgdev.org
Cloudflare WAF

MSP infrastructure access needs to be requested using Entitle for time-bound privileges.

For Terraform Cloud access, see prod Terraform Cloud.

prod Cloud Run

The Cloud Ops Dashboard prod service implementation is deployed on Google Cloud Run.

PROPERTYDETAILS
ConsoleCloud Run service
Service logsGCP logging
Service tracesCloud Trace
Service errorsSentry cloud-ops-prod

You can also use sg msp to quickly open a link to your service logs:

sg msp logs cloud-ops prod

prod Redis

PROPERTYDETAILS
ConsoleMemorystore Redis instances

prod Architecture Diagram

Architecture Diagram

prod Terraform Cloud

This service’s configuration is defined in sourcegraph/managed-services/services/cloud-ops/service.yaml, and sg msp generate cloud-ops prod generates the required infrastructure configuration for this environment in Terraform. Terraform Cloud (TFC) workspaces specific to each service then provisions the required infrastructure from this configuration. You may want to check your service environment’s TFC workspaces if a Terraform apply fails (reported via GitHub commit status checks in the sourcegraph/managed-services repository, or in #alerts-msp-tfc).

To access this environment’s Terraform Cloud workspaces, you will need to log in to Terraform Cloud and then request Entitle access to membership in the “Managed Services Platform Operator” TFC team. The “Managed Services Platform Operator” team has access to all MSP TFC workspaces.

The Terraform Cloud workspaces for this service environment are grouped under the msp-cloud-ops-prod tag, or you can use:

sg msp tfc view cloud-ops prod

dev

PROPERTYDETAILS
Project IDcloud-ops-dev-caff
Categoryinternal
Deployment typemanual
Resourcesdev Redis
Slack notifications#alerts-cloud-ops-dev
Alert policiesGCP Monitoring alert policies list, Dashboard
ErrorsSentry cloud-ops-dev
Domaincloud-ops-dev.sgdev.org
Cloudflare WAF

MSP infrastructure access needs to be requested using Entitle for time-bound privileges.

For Terraform Cloud access, see dev Terraform Cloud.

dev Cloud Run

The Cloud Ops Dashboard dev service implementation is deployed on Google Cloud Run.

PROPERTYDETAILS
ConsoleCloud Run service
Service logsGCP logging
Service tracesCloud Trace
Service errorsSentry cloud-ops-dev

You can also use sg msp to quickly open a link to your service logs:

sg msp logs cloud-ops dev

dev Redis

PROPERTYDETAILS
ConsoleMemorystore Redis instances

dev Architecture Diagram

Architecture Diagram

dev Terraform Cloud

This service’s configuration is defined in sourcegraph/managed-services/services/cloud-ops/service.yaml, and sg msp generate cloud-ops dev generates the required infrastructure configuration for this environment in Terraform. Terraform Cloud (TFC) workspaces specific to each service then provisions the required infrastructure from this configuration. You may want to check your service environment’s TFC workspaces if a Terraform apply fails (reported via GitHub commit status checks in the sourcegraph/managed-services repository, or in #alerts-msp-tfc).

To access this environment’s Terraform Cloud workspaces, you will need to log in to Terraform Cloud and then request Entitle access to membership in the “Managed Services Platform Operator” TFC team. The “Managed Services Platform Operator” team has access to all MSP TFC workspaces.

The Terraform Cloud workspaces for this service environment are grouped under the msp-cloud-ops-dev tag, or you can use:

sg msp tfc view cloud-ops dev

Alert Policies

The following alert policies are defined for each of this service’s environments.

High Container CPU Utilization

High CPU Usage - it may be neccessary to reduce load or increase CPU allocation

Severity: WARNING

High Container Memory Utilization

High Memory Usage - it may be neccessary to reduce load or increase memory allocation

Severity: WARNING

Container Startup Latency

Service containers are taking longer than configured timeouts to start up.

Severity: WARNING

Cloud Redis - System CPU Utilization

Redis Engine CPU Utilization goes above the set threshold. The utilization is measured on a scale of 0 to 1.

Severity: WARNING

Cloud Redis - Standard Instance Failover

Instance failover occured for a standard tier Redis instance.

Severity: WARNING

Cloud Redis - System Memory Utilization

Redis System memory utilization is above the set threshold. The utilization is measured on a scale of 0 to 1.

Severity: WARNING

Cloud Run Pending Requests

There are requests pending - we may need to increase Cloud Run instance count, request concurrency, or investigate further.

Severity: WARNING

Cloud Run Instance Precondition Failed

Cloud Run instance failed to start due to a precondition failure.
This is unlikely to cause immediate downtime, and may auto-resolve if no new instances are created and/or we return to a healthy state, but you should follow up to ensure the latest Cloud Run revision is healthy.

Severity: WARNING

External Uptime Check

Service is failing to repond on https://cloud-ops-dev.sgdev.org - this may be expected if the service was recently provisioned or if its external domain has changed.

Severity: CRITICAL