想必各个公司都有部署zabbix之类的监控系统来监控服务器的资源使用情况、各服务的运行状态,是否这种监控就足够了呢?有没有遇到监控系统一切正常确发现项目无法正常对外提供服务的情况呢?本篇文章聊聊我们如何简单的使用Nagios监控业务的状态
文中的业务指用户访问的网站页面,对外提供的API接口,移动端的APP等产品
通常我们会在项目所在的机房部署一套监控系统来监控我们服务器和MySQL之类的公共服务,制定报警策略,在出现异常情况的时候邮件或短信提醒我们及时处理。
此类监控主要的关注点有两个:
同时也会存在以下两个主要问题:
那么如何解决这两个问题呢?
业务状态监控,就是要最直观的的反映业务当前是正常还是故障,该怎么监控呢?以web项目为例,首先就是要确定具体URL的返回状态,是200正常还是404未找到等,其次要考虑页面里边的内容是不是正常,我们知道最终反馈给用户内容的是由一些静态资源和后端接口数据共同组成的HTML页面,想知道内容究竟对不对这个比较困难,退而求其次我们默认所有静态资源和后端接口都返回正常状态则表示正常,这个监控就比较容易实现了。
静态资源可以直接由nginx服务器处理,nginx的并发能力很强,一般不会成为性能的瓶颈,针对静态资源的监控我们可以结合ELK一起来看。后端接口的处理性能就要差很多了,对业务状态的监控也主要是对后端接口状态的监控,那我们是否需要监控所有的接口呢?这个实施起来比较麻烦,我觉得没太大必要,只需要监控几个有代表性的接口就可以了,例如我们所有的项目中都让开发单独加了一个health check的接口,这个接口的作用是连接项目所有用到的服务进行操作,如接口连接mysql进行数据查询以确定mysql能给正常提供服务,连接redis进行get、set操作以确定redis服务正常,对于这个接口的监控就能覆盖到整个链路的服务情况。
对于监控服务器和业务服务器在同一个机房内所导致的问题(上边讲到的第二点问题),我们可以通过在不同的网络环境内部署独立的状态监控来解决,例如办公区部署Nagios,不同网络监控也更接近用户的网络情况,这套状态监控就区别于机房部署的资源占用监控了,主要用来监控业务的状态,也就是我们上边提到的URL和接口状态。
我们能不能直接将监控部署在机房外的环境来节省一套监控呢?例如公司或者其他的机房部署监控。这样不是个好方案,跨网络的监控性能太差了,首先网络之间的延迟都比同机房内要大的多,其次大量监控项频繁的数据传输对带宽也是不小的压力
我们业务状态监控采用了Nagios,Nagios部署简单配置灵活,这种场景下非常适合。
1.安装基础环境
# apt-get update
# apt-get install -y build-essential libgd2-xpm-dev autoconf gcc libc6 make wget
# apt-get install -y nginx php5-fpm spawn-fcgi fcgiwrap
2.下载并解压nagios
# wget https://assets.nagios.com/downloads/nagioscore/releases/nagios-4.0.8.tar.gz
# tar -zxvf nagios-4.0.8.tar.gz
# cd nagios-4.0.8
# ./configure && make all
# make install-groups-users
# usermod -a -G nagios www-data
# make install
# make install-init
# make install-config
# make install-commandmode
# cd ..
(No output on stdout) stderr: execvp(/usr/local/nagios/libexec/check_ping, ...) failed. errno is 2: No such file or directory
,这是因为我们只安装了nagios的core,没有安装nagios的插件,需要安装插件来支持core工作3.安装nagios-plugins
# wget https://nagios-plugins.org/download/nagios-plugins-2.2.1.tar.gz
# tar -zxvf nagios-plugins-2.2.1.tar.gz
# cd nagios-plugins-2.2.1
# ./configure
# make
# make install
# cd ..
/usr/local/nagios/libexec/
下,可以借助这些插件来监控我们的HTTP接口或主机、服务状态4.创建nagios web访问的账号密码
# vi /usr/local/bin/htpasswd.pl
#!/usr/bin/perl
use strict;
if ( @ARGV != 2 ){
print "usage: /usr/local/bin/htpasswd.pl <username> <password>\n";
}
else {
print $ARGV[0].":".crypt($ARGV[1],$ARGV[1])."\n";
}
# chmod +x /usr/local/bin/htpasswd.pl
#利用perl脚本生成账号密码到htpasswd.users文件中
# /usr/local/bin/htpasswd.pl nagiosadmin nagios@ops-coffee > /usr/local/nagios/htpasswd.users
/usr/local/nagios/etc/cgi.cfg
5.nginx添加server配置,让浏览器可以访问
server {
listen 80;
server_name ngs.domain.com;
access_log /var/log/nginx/nagios.access.log;
error_log /var/log/nginx/nagios.error.log;
auth_basic "Private";
auth_basic_user_file /usr/local/nagios/htpasswd.users;
root /usr/local/nagios/share;
index index.php index.html;
location / {
try_files $uri $uri/ index.php /nagios;
}
location /nagios {
alias /usr/local/nagios/share;
}
location ~ \.php$ {
include /etc/nginx/fastcgi_params;
fastcgi_pass unix:/var/run/php5-fpm.sock;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}
location ~ ^/nagios/(.*\.php)$ {
alias /usr/local/nagios/share/$1;
include /etc/nginx/fastcgi_params;
fastcgi_pass unix:/var/run/php5-fpm.sock;
}
location ~ \.cgi$ {
root /usr/local/nagios/sbin/;
rewrite ^/nagios/cgi-bin/(.*)\.cgi /$1.cgi break;
fastcgi_param AUTH_USER $remote_user;
fastcgi_param REMOTE_USER $remote_user;
include /etc/nginx/fastcgi_params;
fastcgi_pass unix:/var/run/fcgiwrap.socket;
}
}
6.检查配置文件并启动
#检查配置文件是否有语法错误
# /usr/local/nagios/bin/nagios -v /usr/local/nagios/etc/nagios.cfg
#启动nagios服务
# /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
#启动fcgiwrap和php5-fpm服务
# service fcgiwrap restart
# service php5-fpm restart
7.浏览器访问服务器IP或域名就可以看到nagios的页面了,默认有本机的监控数据,不需要的话可以在配置文件localhost.cfg
中删除
Nagios的主配置文件路径为/usr/local/nagios/etc/nagios.cfg
,里边默认已经配置了一些配置文件的路径,cfg_file=后边配置的都是配置文件,nagios程序会来这里读取配置,我们可以新添加一个专门用来监控HTTP API的配置文件
cfg_file=/usr/local/nagios/etc/objects/check_api.cfg
define service{
use generic-service
host_name localhost
service_description web_project_01
check_command check_http!ops-coffee.cn -S
}
define service{
use generic-service
host_name localhost
service_description web_project_02
check_command check_http!ops-coffee.cn -S -u / -e 200
}
define service{
use generic-service
host_name localhost
service_description web_project_03
check_command check_http!ops-coffee.cn -S -u /action/health -k "sign:e5dhn"
}
/usr/local/nagios/etc/objects/templates.cfg
/usr/local/nagios/etc/objects/commands.cfg
SSL is not available
,那么你需要先安装libssl-dev包,然后重新编译(./configure --with-openssl=/usr/bin/openssl
)部署nagios-plugin插件添加对ssl的支持define command {
command_name check_http
command_line $USER1$/check_http -H $ARG1$
}
check_http
就是我们通过安装nagios-plugin生成的,位于/usr/local/nagios/libexec/
下,check_http
的详细用法可以通过check_http -h查看
,支持比较广泛define service {
name generic-service ; The 'name' of this service template
active_checks_enabled 1 ; Active service checks are enabled
passive_checks_enabled 1 ; Passive service checks are enabled/accepted
parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems)
obsess_over_service 1 ; We should obsess over this service (if necessary)
check_freshness 0 ; Default is to NOT check service 'freshness'
notifications_enabled 1 ; Service notifications are enabled
event_handler_enabled 1 ; Service event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
is_volatile 0 ; The service is not volatile
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 2 ; Re-check the service up to 3 times in order to determine its final (hard) state
check_interval 1 ; Check the service every 10 minutes under normal conditions
retry_interval 1 ; Re-check the service every two minutes until a hard state can be determined
contact_groups admins ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 60 ; Re-notify about service problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DON'T REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
}
配置太多就不一一解释了,配合后边的英文注释应该看得懂,说几个重要的
/usr/local/nagios/etc/objects/contacts.cfg
define contact{
contact_name sa ; Short name of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin ; Full name of user
service_notification_period 24x7
host_notification_period 24x7
service_notification_options w,u,c,r
host_notification_options d,u,r
host_notification_commands notify-host-by-email,notify-host-by-sms
service_notification_commands notify-service-by-email,notify-service-by-sms
email [email protected]
pager 15821212121,15822222222
}
define contactgroup{
contactgroup_name admins
alias Nagios Administrators
members sa
}
/usr/local/nagios/etc/objects/commands.cfg
文件内全部配置完成后重启nagios服务,会看到监控已经正常
介绍一款配合nagios用起来非常棒的插件Nagstamon,Nagstamon是一款nagios的桌面小工具(实际上现在不仅仅能配合nagios使用,还能配合zabbix等使用),启动后常驻系统托盘,当nagios监控状态发生变化时会及时的跳出来并发出声音警告,能够更加及时的获取业务状态。
配置如下: