Handbook On-Call
You are here:
Handbook On-Call
On this page
Introduction
GitLab recognizes that the Handbook is a critical part of empowering team members to do their jobs effectively. As such we have implemented a basic on-call process (refer to First-response Service Level Objective below) to ensure that someone is available to assist team members in the event that something is broken in the handbook or if they are having trouble with making updates to it.
Reporting an issue
Any issues should be reported in the #handbook-escalation channel in Slack.
If you do not get a response within the indicated first-response SLO feel free to DM the Editor team Engineering Manager or Product Manager (refer to team page).
When to escalate an issue
Issues should only be escalated to the Handbook On-Call team if it relates to:
Master being broken
Security incidents
Significant broken pages in production (e.g. the values page being unreachable)
Broken infrastructure
Bugs that prevents team members from accessing important information
Time sensitive updates to the Handbook where there are any issue in making the update
On-call schedule
Until recently members of the Static Site Editor
team were part of the on-call process and members of the #handbook-escalation channel. Additionally any GitLab team member can volunteer to join the #handbook-escalation channel and help out.
We are looking into formulating alternatives and the future.
Expectations for being on-call
Make sure you are set to receive notifications for the #handbook-escalation channel
When an issue is reported:
Acknowledge the team member and let them know you are looking into it
You can check on
#production
,#incident-management
, and#is-this-known
to see if it's a know issue with infrastructure or other problems.Provide an update as soon as you are able to confirm their problem.
You can also post updates in
#website
and/or#handbook
as appropriate.Resolve the problem, or provide feedback to the team member on how they can resolve it.
Offer to have a Zoom call to help replicate or resolve the issue if it is not straight forward.
When to hand over to Reliability Engineering
The Handbook On-Call deals specifically with matters relating to the www-gitlab-com
repo source code and configuration. If a reported issue relates to the GitLab product or the infrastructure running the https://about.gitlab.com website then it should be escalated to the Reliability Engineering team. To report an incident follow the instructions on the Incident Management page: /handbook/engineering/infrastructure/incident-management/#reporting-an-incident
First-response Service Level Objective
All incidents reported in the #handbook-escalation channel, during weekdays (Mon - Fri, 08:00 UTC+0 - 18:00 UTC-7), should receive an initial response of acknowledgement within 1 hour of it being reported.
Common Incidents and Tips
Runbook for about.gitlab.com
There is also a runbook for about.gitlab.com incident handling.
Managing broken master alerts in #handbook-escalation
All broken CI pipelines for the master
branch of the www-gitlab-com
repo are automatically posted in the Slack channel. These reports should be investigated and addressed where needed.
Once a report has been looked at, please leave a comment stating the nature of the problem, action taken and add a ✅ reaction to the message to show that it has been handled.
If for some reason there is a large amount of failures resulting in spamming the channel, the error reporting can be turned off in the repo settings: https://gitlab.com/gitlab-com/www-gitlab-com/-/services/slack/edit
Merging urgent MRs
See the description of this issue for details on the current workarounds required for this bug related to the Merge Train
Stuck Merge Train
To see the status of the merge train (useful when team members are reporting that their MRs seem 'stuck' on the train), see this issue to check the status and perform a workaround, if necessary.
TL;DR for workaround: If the first/oldest MR iid
in the FIFO list (sort=asc
by ID) is actively running a pipeline and eventually gets merged, then things are moving along, just slowly. If the first one in the list isn't currently running any pipeline, remove it from the train and re-add it (it should go to the end).