Managed Services incident response
This page includes incident response playbooks the Core Services team can use when operating the Managed Services Platform fleet.
For more MSP user/operator-oriented guidance, refer to the Managed Services infrastructure pages instead.
Basics
Declaring an incident
If a MSP service outage occurs, you should declare an incident, which more or less means using the /incident
command to create an incident.
Assess the impact of the outage and configure the incident as appropriate.
Use the owners
field in service specification to infer what channels and stakeholders need to be notified, if any.
Infrastructure access
Quick links and brief summary below - for more details refer to the more generalized guidance.
- Terraform Cloud
- Core Services team members should be part of the Core Services team in Terraform Cloud, which should have access to all MSP TFC workspaces by default.
- Entitle:
Managed Services Platform Operators
can be used in case a non-Core-Services teammate needs access, or if there is some other issue accessing the workspace. - Entitle:
owners
can be used for escalated access to Sourcegraph’s entire Terraform Cloud account. Use with care!
- Google Cloud Platform
- For individual MSP services, request
mspServiceEditor
ormspServiceReader
via Entitle on the specific service environment’s project. - For groups of MSP services, you can request
mspServiceEditor
ormspServiceReader
on the service’s folder:catogory: prod
services: Entitle:mspServiceEditor
on theManaged Services
foldercatogory: internal
services: Entitle:mspServiceEditor
on the theInternal Services
foldercatogory: test
services: All engineers should have access by default (test services are placed in theEngineering Projects
folder)
mspServiceEditor
andmspServiceReader
are available for convenience, and are configured ingcp/org/customer-roles/msp.tf
in the infrastructure repo. Additional roles can be requested directly via Entitle.
- For individual MSP services, request
Service-specific guidance is generated in Managed Services infrastructure pages.
Changing infrastructure
CLI-apply mode
In peacetime, all service workspaces are left in “VCS mode”, where the remote managed-services
repository is used when running Terraform plan and apply in Terraform Cloud.
Changes to the repository automatically triggers a plan as part of repository CI, and merging to main
automatically deploys the workspaces.
“CLI mode” makes a workspace behave like an “old-school” Terraform workplace, where terraform plan
and terraform apply
only use your local generated Terraform, and the remote repository is ignored (though run execution still occur in Terraform Cloud).
You can place a service environment in “CLI mode” using sg msp tfc sync
:
sg msp tfc sync --workspace-run-mode=cli $SERVICE $ENVIRONMENT
This is useful for quickly making changes without committing/pushing to the remote managed-services
repository first, and prevents other automated mechanisms from changing your infrastructure.
You can manually hand-edit the generated Terraform as well.
Use with care - and when done, remember to make sure the default generated Terraform matches any changes you might have made, and once the main
branch in managed-services
aligns with your changes, the service must be placed in “VCS mode” again:
sg msp tfc sync $SERVICE $ENVIRONMENT
Scenario playbooks
GCP resource recovery
Unintended GCP resource deletions can happen when an unintended Terraform change (for example, conflict with a manually rolled out change upon automation) is merged and deployed in Terraform Cloud.
In general, the first order of business is to revert the unintended Terraform change. A reverse is unlikely to successfully apply in Terraform because many GCP products support “soft deletion”, and attempting to “undelete” a resource will cause a conflict (typically in the ID/name/etc), but it gives us an indicator of how far we are from the last known good state, and will destroy the resources that were created to replace the original resources that you wanted.
At this point, you may want to place the affected workspaces in CLI-apply mode to prevent further unexpected changes.
Next, you want to assess what resources can be recovered - not every GCP product supports it.
Get yourself access to the GCP resources and use the GCP Console to restore the relevant resources.
Then, based on the Terraform plan, start to “reconstruct” the desired state by importing the resources you have restored into the Terraform state using the terraform import
command.
A more detailed example is available in example: recovering from project deletion.
As you recover all the missing resources for each workspace, run a terraform apply
for each to ensure that the workspace is back to a healthy state, and repeat for the next stack in the sequence.
Example: recovering from project deletion
This section covers project deletion as an example, as it occurred in INC-263. Additional mitigations for this specific scenario exists now, but it could still be a useful reference.
In the service’s project
workspace, an attempt to restore the project via a revert to Terraform configuraiton is likely to have run into an error like the following, as the soft-deleted project ID would conflict with the project Terraform is trying to create:
Error: error creating project $PROJECT_ID: googleapi: Error 409: Requested entity already exists, alreadyExists.
with google_project.pings-prod-project
In the event only the project is deleted, pretty much all resources can be recovered by through this page by restoring the project within 30 days. Deleted projects can be viewed in Cloud Resource Manager’s “Pending Deletion” page. Project recovery requires “Owner” permission on the deleted project - if a lot of projects need recovery, the “Owner” role can be requested at the GCP organization level with additional approval so that all projects can be recovered quickly. This is especially important if Entitle has not yet sync’d a project, as it often suffers significant lag time to pick up on more recently created projects.
Once restored, you want to import the project into the Terraform state so that it can be managed by Terraform again.
In the original error above, google_project.pings-prod-project
is the ID of the resource in the Terraform configuration, and we are being told that this particular resource cannot be created because it already exists.
Note that this ID may look different for each service, so inspect the error message and the generated MSP Terraform carefully to double-check if you are unsure.
terraform import google_project.pings-prod-project $PROJECT_ID
Once imported, the service’s project
workspace should now be able to apply as before, as Terraform will no longer attempt a creation, but instead manage the existing imported project.
Conflicts with existing resources can generally be addressed with this strategy using terraform import
.
Note that different resources have different naming conventions for how to import particular resources - consult the provider resource documentation, for example the “Import” section of the google_service_account
resource.
Error messages will also indicate the naming scheme, for example:
Error: Error creating service account: googleapi: Error 409: Service account operatoraccess-980434 already exists within project projects/$PROJECT_ID. Details: [ { "@type": "type.googleapis.com/google.rpc.ResourceInfo", "resourceName": "projects/$PROJECT_ID/serviceAccounts/operatoraccess-980434@$PROJECT_ID.iam.gserviceaccount.com" } ] , alreadyExists
with google_service_account.iam-operatoraccess-account
The corresponding import command would use resourceName
from the error message:
terraform import google_service_account.iam-operatoraccess-account projects/$PROJECT_ID/serviceAccounts/operatoraccess-980434@$PROJECT_ID.iam.gserviceaccount.com