Published on

Monitoring Tools

Authors
  • Name
    Jackson Chen

Prometheus

https://prometheus.io/docs/prometheus/latest/getting_started/

Shows how to install, configure, and use a simple Prometheus instance. You will download and run Prometheus locally, configure it to scrape itself and an example application, then work with queries, rules, and graphs to use collected time series data.

Prometheus can be installed on Linux, Windows. Latest version 2.46 / 2023-07-25

Download and run Prometheus

https://prometheus.io/download/

# Download and run Prometheus
tar xvfz prometheus-*.tar.gz
cd prometheus-*

Configuring Prometheus to monitor itself

Prometheus collects metrics from targets by scraping metrics HTTP endpoints. Since Prometheus exposes data in the same manner about itself, it can also scrape and monitor its own health.

While a Prometheus server that collects only data about itself is not very useful, it is a good starting example. Save the following basic Prometheus configuration as a file named prometheus.yml:

# Config file is yml
    prometheus.yml
  1. Detail of configuration file https://prometheus.io/docs/prometheus/latest/configuration/configuration/

Prometheus is configured via command-line flags and a configuration file. While the command-line flags configure immutable system parameters (such as storage locations, amount of data to keep on disk and in memory, etc.), the configuration file defines everything related to scraping jobs and their instances, as well as which rule files to load.

# To view all available command-line flags, run 
    ./prometheus -h
  1. Reload configuration at run time Prometheus can reload its configuration at runtime. If the new configuration is not well-formed, the changes will not be applied.
A configuration reload is triggered by sending a SIGHUP to the Prometheus process ,or 
sending a HTTP POST request to the /-/reload endpoint (when the --web.enable-lifecycle flag is enabled). 
This will also reload any configured rule files.

To specify which configuration file to load, use the

# Use the flag
--config.file
  1. Generic placeholders as defined
Generic placeholders are defined as follows:

<boolean>: a boolean that can take the values true or false
<duration>: a duration matching the regular expression 
    ((([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?|0)
        e.g. 1d, 1h30m, 5m, 10s
<filename>: a valid path in the current working directory
<float>: a floating-point number
<host>: a valid string consisting of a hostname or IP followed by an optional port number
<int>: an integer value
<labelname>: a string matching the regular expression [a-zA-Z_][a-zA-Z0-9_]*
<labelvalue>: a string of unicode characters
<path>: a valid URL path
<scheme>: a string that can take the values http or https
<secret>: a regular string that is a secret, such as a password
<string>: a regular string
<size>: a size in bytes, e.g. 512MB. A unit is required. 
    Supported units: B, KB, MB, GB, TB, PB, EB.
<tmpl_string>: a string which is template-expanded before usage
  1. scrape_config A scrape_config section specifies a set of targets and parameters describing how to scrape them. In the general case, one scrape configuration specifies a single job. In advanced configurations, this may change.

Targets may be statically configured via the static_configs parameter or dynamically discovered using one of the supported service-discovery mechanisms.

Additionally, relabel_configs allow advanced modifications to any target and its labels before scraping.

  1. tls_config A tls_config allows configuring TLS connections.

  2. oauth2 OAuth 2.0 authentication using the client credentials grant type. Prometheus fetches an access token from the specified endpoint with the given client access and secret keys.

  3. azure_sd_config Azure SD configurations allow retrieving scrape targets from Azure VMs.

The following meta labels are available on targets during relabeling:
__meta_azure_machine_id: the machine ID
__meta_azure_machine_location: the location the machine runs in
__meta_azure_machine_name: the machine name
__meta_azure_machine_computer_name: the machine computer name
__meta_azure_machine_os_type: the machine operating system
__meta_azure_machine_private_ip: the machine's private IP
__meta_azure_machine_public_ip: the machine's public IP if it exists
__meta_azure_machine_resource_group: the machine's resource group
__meta_azure_machine_tag_<tagname>: each tag value of the machine
__meta_azure_machine_scale_set: the name of the scale set which the vm is part of 
    (this value is only set if you are using a scale set)
__meta_azure_machine_size: the machine size
__meta_azure_subscription_id: the subscription ID
__meta_azure_tenant_id: the tenant ID

There are more config configuration.

Starting Prometheus

To start Prometheus with your newly created configuration file, change to the directory containing the Prometheus binary and run:

# Start Prometheus.
# By default, Prometheus stores its database in ./data (flag --storage.tsdb.path).
./prometheus --config.file=prometheus.yml

Explore data that Prometheus has collected about itself

Let us explore data that Prometheus has collected about itself.

To use Prometheus's built-in expression browser, navigate to 
    http://localhost:9090/graph 

and choose the "Table" view within the "Graph" tab.

Reloading configuration

Prometheus instance can have its configuration reloaded without restarting the process by using the SIGHUP signal.

If you're running on Linux this can be performed by using 
    kill -s SIGHUP <PID>

    replacing <PID> with your Prometheus process ID

Shutting down your instance gracefully.

While Prometheus does have recovery mechanisms in the case that there is an abrupt process failure it is recommend to use the SIGTERM signal to cleanly shutdown a Prometheus instance.

If you're running on Linux this can be performed by using 
    kill -s SIGTERM <PID>
    
    replacing <PID> with your Prometheus process ID

Grafana Monitroing

https://grafana.com/docs/grafana/latest/setup-grafana/set-up-grafana-monitoring/

Grafana supports tracing.

Grafana can emit Jaeger or OpenTelemetry Protocol (OTLP) traces for its HTTP API endpoints and propagate Jaeger and w3c Trace Context trace information to compatible data sources. All HTTP endpoints are logged evenly (annotations, dashboard, tags, and so on).

When a trace ID is propagated, it is reported with operation ‘HTTP /datasources/proxy/:id/*’

Configure Grafana

https://grafana.com/docs/grafana/latest/setup-grafana/configure-grafana/#tracingopentelemetry

View Grafana internal metrics

Grafana collects some metrics about itself internally. Grafana supports pushing metrics to Graphite or exposing them to be scraped by Prometheus.

Available metrics

When enabled, Grafana exposes a number of metrics, including:

Active Grafana instances
Number of dashboards, users, and playlists
HTTP status codes
Requests by routing group
Grafana active alerts
Grafana performance

Pull metrics from Grafana into Prometheus

These instructions assume you have already added Prometheus as a data source in Grafana.

  1. Enable Prometheus to scrape metrics from Grafana. In your configuration file (grafana.ini or custom.ini depending on your operating system) remove the semicolon to enable the following configuration options:
# Metrics available at HTTP URL /metrics and /metrics/plugins/:pluginId
[metrics]
# Disable / Enable internal metrics
enabled           = true

# Disable total stats (stat_totals_*) metrics to be generated
disable_total_stats = false
  1. (optional) If you want to require authorization to view the metrics endpoints, then uncomment and set the following options:
basic_auth_username =
basic_auth_password =
  1. Restart Grafana. Grafana now exposes metrics at http://localhost:3000/metrics

  2. Add the job to your prometheus.yml file. Example:

- job_name: 'grafana_metrics'

   scrape_interval: 15s
   scrape_timeout: 5s

   static_configs:
     - targets: ['localhost:3000']
  1. Restart Prometheus. Your new job should appear on the Targets tab.
  2. In Grafana, hover your mouse over the Configuration (gear) icon on the left sidebar and then click Data Sources.
  3. Select the Prometheus data source.
  4. On the Dashboards tab, Import the Grafana metrics dashboard. All scraped Grafana metrics are available in the dashboard.

View Grafana metrics in Graphite

These instructions assume you have already added Graphite as a data source in Grafana.

  1. Enable sending metrics to Graphite. In your configuration file (grafana.ini or custom.ini depending on your operating system) remove the semicolon to enable the following configuration options:
# Metrics available at HTTP API Url /metrics
[metrics]
# Disable / Enable internal metrics
enabled           = true

# Disable total stats (stat_totals_*) metrics to be generated
disable_total_stats = false
  1. Enable metrics.graphite options:
# Send internal metrics to Graphite
[metrics.graphite]
# Enable by setting the address setting (ex localhost:2003)
address = <hostname or ip>:<port#>
prefix = prod.grafana.%(instance_name)s.
  1. Restart Grafana. Grafana now exposes metrics at http://localhost:3000/metrics and sends them to the Graphite location you specified.

Pull metrics from Grafana backend plugin into Prometheus

Any installed backend plugin exposes a metrics endpoint through Grafana that you can configure Prometheus to scrape.

These instructions assume you have already added Prometheus as a data source in Grafana.

  1. Enable Prometheus to scrape backend plugin metrics from Grafana. In your configuration file (grafana.ini or custom.ini depending on your operating system) remove the semicolon to enable the following configuration options:
# Metrics available at HTTP URL /metrics and /metrics/plugins/:pluginId
[metrics]
# Disable / Enable internal metrics
enabled           = true

# Disable total stats (stat_totals_*) metrics to be generated
disable_total_stats = false
  1. (optional) If you want to require authorization to view the metrics endpoints, then uncomment and set the following options:
basic_auth_username =
basic_auth_password =
  1. Restart Grafana. Grafana now exposes metrics at
http://localhost:3000/metrics/plugins/<plugin id>
    e.g. http://localhost:3000/metrics/plugins/grafana-github-datasource

    if you have the Grafana GitHub datasource installed.
  1. Add the job to your prometheus.yml file.
- job_name: 'grafana_github_datasource'

   scrape_interval: 15s
   scrape_timeout: 5s
   metrics_path: /metrics/plugins/grafana-test-datasource

   static_configs:
     - targets: ['localhost:3000']
  1. Restart Prometheus. Your new job should appear on the Targets tab.

  2. In Grafana, hover your mouse over the Configuration (gear) icon on the left sidebar and then click Data Sources.

  3. Select the Prometheus data source.

  4. Import a Golang application metrics dashboard

Zabbix Server Monitoring

https://www.zabbix.com/server_monitoring

Zabbix has a rich set of features to enable users to monitor more than just hosts, offering great flexibility to administrators when it comes to choosing the most suitable option for each situation.

Zabbix is a software that monitors numerous parameters of a network and the health and integrity of servers, virtual machines, applications, services, databases, websites, the cloud and more. Zabbix uses a flexible notification mechanism that allows users to configure email based alerts for virtually any event.

Installation

https://www.zabbix.com/documentation/1.8/en/manual/installation/installation_from_source

Configuration

https://www.zabbix.com/documentation/1.8/en/manual/distributed_monitoring/configuration

Central Node, child node
Zabbix server, Zabbix Proxy
Zabbix agents - Unix, Windows

Everbridge

https://www.everbridge.com/

How Everbridge works

Everbridge empowers organizations to anticipate, mitigate, respond to, and ultimately emerge stronger from critical events with the industry’s only end-to-end critical event management platform. Everbridge delivers reliability, security, and compliance, creating a measurable business advantage for our customers.

Everbridge digitizes organizational resilience at scale and enables customers to protect their people and assets with an integrated suite of critical event management solutions.

Everbride digital operations platform

How Everbridge works Everbridge empowers organizations to anticipate, mitigate, respond to, and ultimately emerge stronger from critical events with the industry’s only end-to-end critical event management platform. Everbridge delivers reliability, security, and compliance, creating a measurable business advantage for our customers.

Monitor risk and performance across all systems:

  1. Integrate development, project management, security, and customer success tools across the organization
  2. Combine signals from multiple monitoring tools into relevant, contextual actions
  3. Quickly find the root cause of performance issues before they impact customers

Automate IT incident response management:

  1. Initiate proactive incident management workflows and alerts
  2. Orchestrate rapid response to critical events
  3. Alert the right people to threats
  4. Solve active issues with AI-powered incident matching from historical fixes

Low-code, intelligent workflows for operations teams:

  1. Use templates or build workflows from an intuitive UI
  2. Automate on-call scheduling
  3. Eliminate complex refactoring projects and unwanted dependencies

Everbride Zabbix integration

https://www.everbridge.com/products/it-alerting/integrations/zabbix-integration-guide/

Pager Duty

https://www.pagerduty.com/

The essential platform for critical operations work. Transform operations and move business forward faster with the PagerDuty Operations CloudTM.

Enterprise alert management - OnPage for IT teams.

Solutions

# DevOps
    Incident response
    Runbook automation
    On-call management
    Automation Actions

 # Process Automation
    Process Automation
    Runbook Automation
    Automation Actions

Configurable service settings

https://support.pagerduty.com/docs/configurable-service-settings

Service settings allow you to customize what actions you would like to be performed when an incident is triggered. Service settings address incident assignment and notifications, noise reduction, coordinating with stakeholders, event rules and remediation resources. By configuring these settings, you can optimize each incident to address your team’s specific needs.

Services and Integrations

Create a service and add integrations to begin receiving incident notifications

A technical service reflects a discrete piece of functionality that is wholly owned by one team. One or more technical services combine to deliver customer-facing or business capabilities. You can add one or more integrations to a technical service in order to receive events from those tools.

When creating new services, we highly recommend reading our Full Service Ownership guide. This guide provides best practices that software delivery operations teams can use to ensure they are equipped to fully own their services.

PagerDuty Service Standards

https://www.pagerduty.com/blog/introducing-service-standards/

Configure PagerDuty to receive alerts

https://cloudone.trendmicro.com/docs/application-security/pagerduty-alerts/

Single Sign On - SSL

https://support.pagerduty.com/docs/sso

PagerDuty can be configured with Single Sign-On (SSO) to external identity providers (IdPs) such as Microsoft Active Directory (using ADFS), Bitium, OneLogin, Okta, Ping Identity, SecureAuth and others using the SAML 2.0 protocol. Alternatively, your account can also be configured to support Google authentication using the OAuth 2.0 protocol (with user consent). SSO comes with the following benefits:

  1. One-Click Corporate Login: This eliminates the need for a separate PagerDuty username and password, which means one less thing to remember.
  2. On-Demand User Provisioning: PagerDuty user accounts are created on-demand once access is granted via the SSO provider.
  3. Revoke User Access: When an employee leaves the company, administrators can remove PagerDuty access within the SSO provider rather than having to log directly into PagerDuty.

Observium

Network monitoring with intuition Observium is a network monitoring and management platform that provides real-time insight into network health and performance. It can automatically discover network devices and services, collect performance metrics, and generate alerts when problems are detected.

Observium includes a web-based interface that allows users to view network status and performance metrics in real time, as well as historical data. It is designed to be easy to use and maintain, with a focus on providing the information that network administrators need to quickly identify and resolve issues

Observium supports a wide range of device types, platforms and operating systems including Cisco, Windows, Linux, HP, Juniper, Dell, FreeBSD, Brocade, Netscaler, NetApp and many more.

Professionally developed and maintained by a team of experienced network engineers and systems administrators, Observium is a platform designed and built by its users.

Editions

Observium Professional and Enterprise are distributed via an SVN-based release mechanism, providing rapid access to daily security and bug fixes as well as new features. A summary of fixes and improvements can be found in the Changelog.

Observium Community is distributed via 6-monthly .tar.gz releases under the QPL Open Source license. Observium Community is distributed via 6-monthly .tar.gz releases under the QPL Open Source license.

Installation

https://docs.observium.org/install_rhel/

  1. Install ssh to the observium server if it has not been installed
yum install openssh
  1. Start ssh server and add it to system startup
systemctl enable sshd && systemctl start sshd
  1. Add respositories and packages Add EPEL, OpenNMS and REMI repositories, and switch to REMI's PHP 7.4 packages.
# RHEL / Centos 8
yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
yum install http://yum.opennms.org/repofiles/opennms-repo-stable-rhel8.noarch.rpm
yum install http://rpms.remirepo.net/enterprise/remi-release-8.rpm
yum install yum-utils
dnf module enable php:remi-7.4

# RHEL / Centos 9
dnf config-manager --set-enabled crb
dnf install \
   https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm \
   https://dl.fedoraproject.org/pub/epel/epel-next-release-latest-9.noarch.rpm -y
dnf install dnf-utils http://rpms.remirepo.net/enterprise/remi-release-9.rpm -y
dnf module enable php:remi-8.2 -y

Install the packages required for Observium

yum install wget httpd php php-opcache php-mysqlnd php-gd php-posix php-pear cronie net-snmp \
            net-snmp-utils fping mariadb-server mariadb rrdtool subversion whois ipmitool graphviz \
            ImageMagick php-sodium python3 python3-mysql python3-PyMySQL

Set Python3 to be the default Python version

alternatives --set python /usr/bin/python3

# If you want to monitor libvirt virtual machines, install libvirt
yum install libvirt
  1. Download Observium First, create a directory for Observium to live in:
mkdir -p /opt/observium && cd /opt
# Observium Community Edition - Download the latest .tar.gz of Observium and unpack:

wget http://www.observium.org/observium-community-latest.tar.gz
tar zxvf observium-community-latest.tar.gz

# For the subscription stable train - Professional / Enterprise 

svn co https://svn.observium.org/svn/observium/branches/stable observium
  1. MySQL database
# Start MySQL/MariaDB and configure it to be run at startup.
    systemctl enable mariadb
    systemctl start mariadb

# Set the MySQL root password:
    /usr/bin/mysqladmin -u root password '<mysql root password>'

# Create the MySQL database:
    mysql -u root -p
    <mysql root password>
    mysql> CREATE DATABASE observium DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
    mysql> GRANT ALL PRIVILEGES ON observium.* TO 'observium'@'localhost' IDENTIFIED BY '<observium db password>';
    mysql> exit;
  1. Observium Configuration
# Change into the new install directory:
    cd observium

# Copy the default configuration file and edit it for your system:
    cp config.php.default config.php

# Edit config.php
    Change the options to reflect your installation.
  1. Insert MySQL Schema
Run the discovery.php script with the upgrade switch -u in order to insert the initial MySQL schema
    ./discovery.php -u

It is OK to have some errors in the SQL revisions

  1. Fping Since Fping is in a different location, add a line to config.php to tell Observium.
[root@observium-centos observium]# which fping
/sbin/fping

# Add the following
$config['fping'] = "/sbin/fping";
  1. SELinux If you are competent enough to maintain SELinux, then that is possible too, but is an even more unsupported configuration than RHEL/CentOS themselves.

Firstly, disable SELinux. You can do this temporarily with the following command:

# Tempoary disable SELinux
    setenforce 0

We need to disable SELinux permanently, so you also need to change /etc/selinux/config so that the SELINUX option is set to permissive
    SELINUX=permissive
  1. System Create the rrd directory to store RRDs in:
# Create rrd directory
    mkdir rrd
    chown apache:apache rrd

# If the server will be running only Observium, create /etc/httpd/conf.d/observium.conf with these contents :
    <VirtualHost *>
    DocumentRoot /opt/observium/html/
    ServerName  observium.domain.com
    CustomLog /opt/observium/logs/access_log combined
    ErrorLog /opt/observium/logs/error_log
    <Directory "/opt/observium/html/">
        AllowOverride All
        Options FollowSymLinks MultiViews
        Require all granted
    </Directory>
    </VirtualHost>

# Create logs directory for apache
    mkdir /opt/observium/logs
    chown apache:apache /opt/observium/logs

# Add a first user, use level of 10 for admin:
    cd /opt/observium
    ./adduser.php <username> <password> <level>

# Add a first device to monitor:
    ./add_device.php <hostname> <community> v2c

# Do an initial discovery and polling run to populate the data for the new device:
    ./discovery.php -h all
    ./poller.php -h all
  1. Cron Add cron jobs, create a new file /etc/cron.d/observium with the following contents:
# Run a complete discovery of all devices once every 6 hours
33  */6   * * *   root    /opt/observium/observium-wrapper discovery >> /dev/null 2>&1

# Run automated discovery of newly added devices every 5 minutes
*/5 *     * * *   root    /opt/observium/observium-wrapper discovery --host new >> /dev/null 2>&1

# Run multithreaded poller wrapper every 5 minutes
*/5 *     * * *   root    /opt/observium/observium-wrapper poller >> /dev/null 2>&1

# Run housekeeping script daily for syslog, eventlog and alert log
13 5      * * * root /opt/observium/housekeeping.php -ysel >> /dev/null 2>&1

# Run housekeeping script daily for rrds, ports, orphaned entries in the database and performance data
47 4      * * * root /opt/observium/housekeeping.php -yrptb >> /dev/null 2>&1

And reload the cron process:

systemctl reload crond
  1. Final Points Let's set the httpd to startup when we reboot the server:
    systemctl enable httpd
    systemctl start httpd
    Permit HTTP through the server's default firewall

    firewall-cmd --permanent --zone=public --add-service=http
    firewall-cmd --reload
  1. Problems When running eg. poller.php or discovery.php a lot of notices regarding undefined indexes, variables and offsets. To hide these notices you can do the following:
# Edit php.ini
vi /etc/php.ini

# Find the line containing:
    error_reporting = E_ALL & ~E_DEPRECATED

# Change this to:
    error_reporting = E_ALL & ~E_NOTICE

Datadog

https://www.datadoghq.com/

Monitoring cloud and on-prem systems, apps and services

Infrastructure monitoring

https://www.datadoghq.com/product/infrastructure-monitoring/