original text Abridged slightly
Alarm overview
Prometheus alarms are divided into two parts. Alarm rules in Prometheus server send alarms to alert manager, which processes these alarms, including silence, inhibition, aggregation, and sending notifications through email, online notification system and instant messaging tools.
The main steps for setting alarms and notifications are as follows:
- Setting up and configuring Alertmanager
- Configure Prometheus to communicate with Alertmanager
- Creating alarm rules in Prometheus
ALERTMANAGER
Alertmanager deals with alerts sent by client programs such as Prometheus server. It is responsible for the alarm de duplication, grouping and routing to the correct recipients, such as mail, web pages, etc. It is also responsible for silence and suppression of alarms.
The following describes the core concepts of Alertmanager. Consult the configuration file for more details on how to use it.
Grouping of alarms
Grouping groups alarms of a similar nature into one notification. This function is particularly effective when a large-scale power failure causes many systems to hang up at the same time, causing a large number of alarms.
**Example: * * your cluster runs hundreds of instances of a service. When a network partition occurs, half of the service instances cannot connect to the database. Prometheus alarm rule is set to send an alarm for each service instance. As a result, hundreds of alarms are sent to alert manager.
As a user, you want to receive only one message, which contains the specific service instances affected. In this way, you can configure alert manager to group alarms by cluster name and alarm name, and it can send only one notification.
Alarms are grouped. The notification time of each alarm group and the notification receiver of each alarm group are configured in the routing tree of the configuration file.
Alarm suppression
Alarm suppression means to suppress specific alarms if some alarms occur.
**Example: * * an alarm is triggered to inform the whole cluster of disconnection. You can configure Alertmanager to suppress all alarms related to the cluster. This prevents hundreds of alert notifications that are unrelated to the actual problem from being triggered.
Alarm suppression is configured in the configuration file of Alertmanager.
Alarm silence
Alarm silence is to directly stop the alarm for a period of time. The alarm silence configuration is based on a matcher, similar to a routing tree. Check whether the incoming alarm matches the regular expression of alarm silence setting. If it matches, no alarm notification will be sent.
Alarm silence is configured in the Alertmanager web interface.
Client behavior
Alertmanager has special requirements for client behavior. These relate only to advanced use cases that do not use Prometheus to send alerts.
High availability
Alertmanager supports the configuration of highly available clusters. You can configure it by using the -- cluster - * parameter.
Instead of load balancing between Prometheus and Alertmanager, specify a list of all alertmanagers in Prometheus.
To configure
Alertmanager is configured with command line parameters and configuration files. Command line parameters are configured with constant system parameters. The configuration file defines suppression rules, notification routes, and notification receivers.
A visual editor can assist in building a routing tree.
Run alertmanager -h to browse for available command line parameters.
Alertmanager can reload the configuration file at run time. If the new profile is not in the correct format, changes are not applied and errors are logged. By sending a SIGHUP to the process or an HTTP POST request to the / - / reload endpoint.
configuration file
Specify the loaded configuration file through the -- config.file parameter.
./alertmanager --config.file=alertmanager.yml
The file is in YAML format, defined by the following format. Optional when parentheses represent parameters. Unlisted parameters are set to default values.
General placeholders are defined as follows:
- < duration >: duration, matching regular expression [0-9]+(ms|[smhdwy])
- < labelname >: string matching [a-za-z] [a-za-z0-9] * regular expression
- < labelvalue >: a string of Unicode characters
- < filepath >: a valid path in the current working directory
- < Boolean >: Boolean value, true or false
- < string >: standard string
- < secret >: the standard string of a ciphertext, such as a password
- < tmpl_string >: template formatted string
- < tmpl_secret >: template formatted ciphertext string
Other placeholders are described separately.
Parameters that global configuration settings take effect in the entire configuration. They are also used as default values for other configuration segments.
global: # Configure mail recipients # The default SMTP From header field. [ smtp_from: <tmpl_string> ] # The default SMTP smarthost used for sending emails, including port number. # Port number usually is 25, or 587 for SMTP over TLS (sometimes referred to as STARTTLS). # Example: smtp.example.org:587 [ smtp_smarthost: <string> ] # The default hostname to identify to the SMTP server. [ smtp_hello: <string> | default = "localhost" ] # SMTP Auth using CRAM-MD5, LOGIN and PLAIN. If empty, Alertmanager doesn't authenticate to the SMTP server. [ smtp_auth_username: <string> ] # SMTP Auth using LOGIN and PLAIN. [ smtp_auth_password: <secret> ] # SMTP Auth using PLAIN. [ smtp_auth_identity: <string> ] # SMTP Auth using CRAM-MD5. [ smtp_auth_secret: <secret> ] # The default SMTP TLS requirement. # Note that Go does not support unencrypted connections to remote SMTP endpoints. [ smtp_require_tls: <bool> | default = true ] # Configure api recipients such as wechat # The API URL to use for Slack notifications. [ slack_api_url: <secret> ] [ victorops_api_key: <secret> ] [ victorops_api_url: <string> | default = "https://alert.victorops.com/integrations/generic/20131114/alert/" ] [ pagerduty_url: <string> | default = "https://events.pagerduty.com/v2/enqueue" ] [ opsgenie_api_key: <secret> ] [ opsgenie_api_url: <string> | default = "https://api.opsgenie.com/" ] [ hipchat_api_url: <string> | default = "https://api.hipchat.com/" ] [ hipchat_auth_token: <secret> ] [ wechat_api_url: <string> | default = "https://qyapi.weixin.qq.com/cgi-bin/" ] [ wechat_api_secret: <secret> ] [ wechat_api_corp_id: <string> ] # Configure page recipients # The default HTTP client configuration [ http_config: <http_config> ] # If the end time of the alarm is not set, ResolveTimeout is the default value of alertmanager, # After this period of time, if the alarm is not updated, it will be declared not recovered. # This does not conflict with Prometheus alarms, because they all have an end time set. [ resolve_timeout: <duration> | default = 5m ] # Files from which custom notification template definitions are read. # The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'. templates: [ - <filepath> ... ] # The root node of the routing tree. route: <route> # List of recipients. receivers: - <receiver> ... # Suppression rule list. inhibit_rules: [ - <inhibit_rule> ... ]
A routing block defines a node and its children in the routing tree. Its optional configuration inherits the configuration of its parent node by default if it is not set.
Each alarm enters the routing tree from the configured top-level route, and the top-level route matches all alarms. Then traverse the child nodes. If the continue setting is not false, it will end at the first matching node. If the continue of a matching node is true, the alarm will continue to match the subsequent sibling nodes. If an alarm does not match any child nodes of a node, the alarm will be processed based on the configuration parameters of the current node.
[ receiver: <string> ] # Label for grouping alarms. For example, cluster=A and alertname=LatencyHigh # Multiple alarms for are grouped together. # To aggregate by all tags, use "..." as the tag, for example: # group_by: ['...'] # This disables the grouping feature. [ group_by: '[' <labelname>, ... ']' ] # Whether the alarm continues to match brother nodes. [ continue: <boolean> | default = false ] # A set of equal matchers to be matched by an alarm to match a node. match: [ <labelname>: <labelvalue>, ... ] # A set of regular matchers to be matched by an alarm to match a node. match_re: [ <labelname>: <regex>, ... ] # The waiting time of a group of alarms before sending the notice, which is used to wait for the arrival of the suppression alarm # Or collect more alarms of the same group. (usually 0s ~ several minutes) [ group_wait: <duration> | default = 30s ] # After sending an alarm, wait how long before sending the new alarm of the same group. (usually 5m or longer) [ group_interval: <duration> | default = 5m ] # How long to wait for an alarm to be sent again. (usually 3 hours or more) [ repeat_interval: <duration> | default = 4h ] # 0 or more child routes. routes: [ - <route> ... ]
Example
# The root route that contains all parameters. If the sub route does not set corresponding parameters, it inherits from this. route: receiver: 'default-receiver' group_wait: 30s group_interval: 5m repeat_interval: 4h group_by: [cluster, alertname] # All alarms that do not match the following sub routes will terminate at the root node and be sent to the 'default receiver'. routes: # All alarms with service=mysql or service=cassandra # Send to database pager. - receiver: 'database-pager' group_wait: 10s match_re: service: mysql|cassandra # The alarm with team=frontend tag matches this sub route. # They are grouped by product and environment, not by cluster # And alertname. - receiver: 'frontend-pager' group_by: [product, environment] match: team: frontend
<inhibit_rule>
The alarm suppression rule is set to stop sending target alarms matching some matchers when the source alarms matching a group of matchers are sent. Source and target alarms must have the same tag values listed in the equal list.
Semantically, missing tags and null tags are one meaning. Therefore, if the source and target alarms are indeed all the labels listed in the equal list, the alarm suppression rules will be applied.
In order to prevent an alarm from suppressing itself, the alarms matching a rule's source and target cannot be suppressed by any alarm (including itself). However, it is recommended to ensure that there is no possibility of matching at the same time when setting up the source and target matchers.
# A matcher that must be satisfied when an alarm is suppressed. target_match: [ <labelname>: <labelvalue>, ... ] target_match_re: [ <labelname>: <regex>, ... ] # If one or more alarms of the following matchers exist, # The alarm suppression takes effect. source_match: [ <labelname>: <labelvalue>, ... ] source_match_re: [ <labelname>: <regex>, ... ] # At least one of the following tag values of source and target alarms is equal # Alarm suppression can take effect. [ equal: '[' <labelname>, ... ']' ]
<http_config>
HTTP config configures the HTTP client that the recipient communicates with the HTTP API based service.
# Note that the options of 'basic ` auth', 'bear ` token' and 'bear ` token ` file' are mutually exclusive. # Sets the `Authorization` header with the configured username and password. # password and password_file are mutually exclusive. basic_auth: [ username: <string> ] [ password: <secret> ] [ password_file: <string> ] # Use bare token to configure the 'Authorization' request header. # Password and password "file are mutually exclusive. [ bearer_token: <secret> ] # Read the bare token configuration 'Authorization' from the configuration file. Request header [ bearer_token_file: <filepath> ] # Configure TLS settings. tls_config: [ <tls_config> ] # Optional proxy URL. [ proxy_url: <string> ]
<tls_config>
Configure TLS connection
# CA certificate to validate the server certificate with. [ ca_file: <filepath> ] # Certificate and key files for client cert authentication to the server. [ cert_file: <filepath> ] [ key_file: <filepath> ] # ServerName extension to indicate the name of the server. # http://tools.ietf.org/html/rfc4366#section-3.1 [ server_name: <string> ] # Disable validation of the server certificate. [ insecure_skip_verify: <boolean> | default = false]
A sink is a named configuration of one or more notification integrations.
We didn't actively add new receivers, and we recommend custom notification integration through the webhook receiver.
# Globally unique receiver name. name: <string> # Configure various receive integrations. email_configs: [ - <email_config>, ... ] hipchat_configs: [ - <hipchat_config>, ... ] pagerduty_configs: [ - <pagerduty_config>, ... ] pushover_configs: [ - <pushover_config>, ... ] slack_configs: [ - <slack_config>, ... ] opsgenie_configs: [ - <opsgenie_config>, ... ] webhook_configs: [ - <webhook_config>, ... ] victorops_configs: [ - <victorops_config>, ... ] wechat_configs: [ - <wechat_config>, ... ]
<email_config>
# Whether to notify fault recovery. [ send_resolved: <boolean> | default = false ] # The email address of the notification. to: <tmpl_string> # Sender address. [ from: <tmpl_string> | default = global.smtp_from ] # SMTP host sent by mail. [ smarthost: <string> | default = global.smtp_smarthost ] # The host name of the authentication SMTP server. [ hello: <string> | default = global.smtp_hello ] # SMTP authentication information. [ auth_username: <string> | default = global.smtp_auth_username ] [ auth_password: <secret> | default = global.smtp_auth_password ] [ auth_secret: <secret> | default = global.smtp_auth_secret ] [ auth_identity: <string> | default = global.smtp_auth_identity ] # SMTP TLS requirements. # Note that Go does not support unencrypted connections to remote SMTP endpoints. [ require_tls: <bool> | default = global.smtp_require_tls ] # TLS configuration. tls_config: [ <tls_config> ] # HTML body for mail notification. [ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ] # The text body of the email notification. [ text: <tmpl_string> ] # Additional email header key value pairs. [ headers: { <string>: <tmpl_string>, ... } ]
<webhook_config>
Configure universal receiver
# Whether to notify fault recovery. [ send_resolved: <boolean> | default = true ] # The url of the destination endpoint to send the HTTP POST request. url: <string> # Configuration of the HTTP client. [ http_config: <http_config> | default = global.http_config ]
The JSON format for Alertmanager to send HTTP POST requests to configured endpoints:
{ "version": "4", "groupKey": <string>, // key identifying the group of alerts (e.g. to deduplicate) "status": "<resolved|firing>", "receiver": <string>, "groupLabels": <object>, "commonLabels": <object>, "commonAnnotations": <object>, "externalURL": <string>, // backlink to the Alertmanager. "alerts": [ { "status": "<resolved|firing>", "labels": <object>, "annotations": <object>, "startsAt": "<rfc3339>", "endsAt": "<rfc3339>", "generatorURL": <string> // identifies the entity that caused the alert }, ... ] }
<wechat_config>
Send notifications via wechat API.
# Whether to notify fault recovery. [ send_resolved: <boolean> | default = false ] # Call the key for wechat API. [ api_secret: <secret> | default = global.wechat_api_secret ] # Wechat API URL [ api_url: <string> | default = global.wechat_api_url ] # The corp id used for authentication. [ corp_id: <string> | default = global.wechat_api_corp_id ] # API request data defined by wechat API. [ message: <tmpl_string> | default = '{{ template "wechat.default.message" . }}' ] [ agent_id: <string> | default = '{{ template "wechat.default.agent_id" . }}' ] [ to_user: <string> | default = '{{ template "wechat.default.to_user" . }}' ] [ to_party: <string> | default = '{{ template "wechat.default.to_party" . }}' ] [ to_tag: <string> | default = '{{ template "wechat.default.to_tag" . }}' ]
Send alarm
Disclaimer: Prometheus automatically processes and sends alarms generated by its configured alarm rules. It is strongly recommended to configure alarm rules in Prometheus based on time series data, rather than implementing clients.
Alertmanager has two APIs, V1 and V2, which monitor alarms. The alarm format of V1 is described in the following code snippet. V2 is specified as the OpenAPI specification, which can be found in the alertmanager code base. As long as the client is alive, it can repeatedly send the alarm (usually the interval is 30 seconds to 3 minutes). The client can push an alarm list through POST request.
The label of each alarm is used to identify the same instance of the alarm for de duplication. annotations is always set to the most recently received comment and cannot identify alarms.
The startsAt and endsAt timestamps are optional. If startsAt is omitted, it is automatically set to the current time. Endsats are set only when the end time of the alarm is known. Otherwise, it will be set to the length of time since the last alarm was received.
The generator URL field is the only reverse link that identifies the source of this alert in the client.
[ { "labels": { "alertname": "<requiredAlertName>", "<labelname>": "<labelvalue>", ... }, "annotations": { "<labelname>": "<labelvalue>", }, "startsAt": "<rfc3339>", "endsAt": "<rfc3339>", "generatorURL": "<generator_url>" }, ... ]
Notification template reference
Prometheus creates and sends alerts to alert manager, which sends alerts to different receivers based on their tags. The receiver can be one of many integrations, including Slack, PagerDuty, email, or integration customized through a common webhook interface.
Notifications are sent to the receiver through a template. Alertmanager comes with its own default receivers, which can also be customized. To avoid confusion, the alert manager template is different from the template in Prometheus, although the Prometheus template also contains the template in alarm rules labels/annotations.
Alert manager's notification template is based on the template system of Go language. Note that some fields are defined as text, while others are defined as HTML, which affects escaping.
data structure
data
Data is the data structure passed to the notification template and Webhook push.
Name | type | Interpretation |
---|---|---|
Receiver | string | The name of the receiver to send the notification (slack, email, etc.) |
Status | string | If at least one alarm is being sent, it is burning, otherwise it is resolved. |
Alerts | Alert | List all alarm objects in this group (see below). |
GroupLabels | KV | The group label for which these alarms are grouped. |
CommonLabels | KV | Tags common to all alarms. |
CommonAnnotations | KV | annotations set common to all alarms. The string used to get more information about the alert. |
ExternalURL | string | Send backlinks for notifications. |
Functions of Alerts type to expose filter alarms:
- Alerts.Firing returns the list of alarm objects currently sent by the group
- Alerts.Resolved returns the list of recovered alert objects in this group
Give an alarm
Alert holds the alert notification template.
Name | type | Interpretation |
---|---|---|
Status | string | Whether the alarm current status is sent or restored. |
Labels | KV | Set of label s attached to the alarm. |
Annotations | KV | annotations set attached to alarms. |
StartsAt | time.Time | If the time of alarm triggering is omitted, Alertmanager will set it as the current time. |
EndsAt | time.Time | Endsats are set only when the end time of the alarm is known. Otherwise, it will be set to the length of time since the last alarm was received. |
GeneratorURL | string | Identify the reverse link of the alarm source. |
Key value pair (KV)
KV is a set of key value pairs used to identify label and annotation.
type KV map[string]string
The Annotation example contains two annotations:
{ summary: "alert summary", description: "alert description", }
In addition to direct access to data (labels and comments) stored as KV, there are several ways to sort, delete, and view label sets:
KV methods
Name | parameter | Return value | Interpretation |
---|---|---|---|
SortedPairs | - | List of key / value string pairs | Returns a sorted list of key value pairs. |
Remove | []string | KV | Returns a copy of the list of key value pairs that do not contain the specified key. |
Names | - | []string | Returns the key in the label set. |
Values | - | []string | Returns the value in the label set. |
function
Note that the default function is provided by the Go language template.
Character string
Name | parameter | Return value | Interpretation |
---|---|---|---|
title | string | strings.Title, the first character of each word is capitalized. | |
toUpper | string | strings.ToUpper, which converts all characters to uppercase. | |
toLower | string | strings.ToLower, which converts all characters to lowercase. | |
match | pattern, string | Regexp.MatchString. Use regexp to match strings. | |
reReplaceAll | pattern, replacement, text | Regular expression replacement, not fixed. | |
join | sep string, s []string | strings.Join, which connects the elements of s to create a single string. The separator string sep is placed between the elements in the result string. (Note: the order of the parameters is reversed for easier pipelining in the template.) | |
safeHtml | text string | html/template.HTML, which marks a string as HTML and does not need to be escaped automatically. | |
stringSlice | ...string | Returns a string slice consisting of multiple strings passed. |
Notification template example
Here are some different alarm examples and the settings of the corresponding Alertmanager configuration file (alertmanager.yml). Each uses the Go template system.
Customize Slack notifications
In this example, we customized a Slack notification to send a URL to an organization's Wiki about how to handle a specific alert that has been sent.
global: slack_api_url: '<slack_webhook_url>' route: receiver: 'slack-notifications' group_by: [alertname, datacenter, app] receivers: - name: 'slack-notifications' slack_configs: - channel: '#alerts' text: 'https://internal.myorg.net/wiki/alerts/{{ .GroupLabels.app }}/{{ .GroupLabels.alertname }}'
Access annotations in common annotations
In this example, we once again customize the text sent to the Slack receiver to access the summary and description stored in common annotations, which are common comments in the data sent by alert manager.
Alert
groups: - name: Instances rules: - alert: InstanceDown expr: up == 0 for: 5m labels: severity: page # Here, Prometheus template applies the annotation and label fields of alarm. annotations: description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: 'Instance {{ $labels.instance }} down'
Receiver
- name: 'team-x' slack_configs: - channel: '#alerts' # Apply the Alertmanager template here. text: "<!channel> \nsummary: {{ .CommonAnnotations.summary }}\ndescription: {{ .CommonAnnotations.description }}"
Traverse all received alarms
Finally, assuming that the alerts are the same as the previous example, we customize the sink to override all alerts received from Alertmanager and print their respective comment summary and description on a new line.
Receiver
- name: 'default-receiver' slack_configs: - channel: '#alerts' title: "{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}" text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
Define reusable templates
Returning to the first example, we can also provide a file that contains a named template, which is then loaded by alert manager to avoid complex templates that span multiple lines. Create a file below
{{ define "slack.myorg.text" }}https://internal.myorg.net/wiki/alerts/{{ .GroupLabels.app }}/{{ .GroupLabels.alertname }}{{ end}}
Now, the configuration will load the template with the "text" field of the given name, and we provide the path to the custom template file:
lobal: slack_api_url: '<slack_webhook_url>' route: receiver: 'slack-notifications' group_by: [alertname, datacenter, app] receivers: - name: 'slack-notifications' slack_configs: - channel: '#alerts' text: '{{ template "slack.myorg.text" . }}' templates: - '/etc/alertmanager/templates/myorg.tmpl'
Managing API
Alert manager provides a set of management API s to simplify automation and integration.
Health examination
GET /-/healthy
This endpoint is used for Alertmanager's health check and returns 200 normally.
readiness check
GET /-/ready
This endpoint is used to check whether alert manager can provide services (such as corresponding requests), and returns 200 normally.
Re loading
POST /-/reload
This endpoint triggers Alertmanager to reload the configuration file.
Another way to trigger a configuration reload is to send a SIGHUP signal to the Alertmanager process.