Audience: IT Support Staff

***** Note: This process was decommissioned during the transition to postgresql. This is only kept for historical documentation purposes.

When used: The purpose of this document is to describe the group alerting process from a “high level” and how to operate it from the aspect of setup, starting and stopping the process.

The group alerting process is a periodic process that is continuously running once started. It is a process that is doing nothing most of the time because it is in a “sleep mode”. It consumes very little CPU cycles since it is “sleeping” most of the time. The purpose of the group alerting process is to group together targets that are tested for notification as a group to selected individuals when problems arise. The group alerting process is a process that runs on our production server IPREV2 and not on our nodes distributed all across the world.

The group alerting process will send out at most two outage alerts in a row for a particular tested target if it is the only target failing within a group. The outage alert is sent via email and pager. Please note that the alert can be sent to a list of email address and also to a list of pager numbers. No more outage alerts will be sent for a target that has already failed two tests in a row if it is the only target failing within a group of tested targets. The only exception to this last statement is if other targets start failing. If other targets start failing, then those new failing targets will generate an outage alert. Include in this new outage alert is all failing nodes regardless of how many times they have failed in a row tests. Please note that if a group contains, for example, 10 tested targets and all targets are failing, after the last failing target has generated 2 outage alerts, no more outage alerts will be sent.

The group alerting process will also send notifications when failing targets return to a normal state or a not failing state. To summarize the group alerting process can send three possible alerts – Group Outage Alert, Group Back to Normal Alert, and Group Combined Alert – Outage / Back to Normal Alert.

A Group Outage Alert is a notification about a target or multiple targets that are failing. The following is an example Group Outage Alert email that is sent:

Location

Account

Status

Alias

DRDC Internal

drdcweb_int

Failing

01 - DRDC DADOTP01 esuds.sprint.com

DRDC Internal

drdcweb_int

Failing

04 - DRDC DADOTP01 napi.sprint.com

DRDC Internal

drdcweb_int

Failing

06 - DRDC DADOTP01 vanity.sprint.com

DRDC Internal

drdcweb_int

Failing

08 - DRDC DADOTP01 csb.sprint.com

DRDC Internal

drdcweb_int

Failing

10 - DRDC DADOTP01 www.sprintesolutions.com

Total Error Count: 5

A Group Back to Normal Alert is a notification about a target or multiple targets that were failing but are now back in normal state or non failing state. The following is an example Group Back to Normal Alert email that is sent:

Location

Account

Status

Alias

DRDC Internal

drdcweb_int

Failing

01 - DRDC DADOTP01 esuds.sprint.com

DRDC Internal

drdcweb_int

Failing

04 - DRDC DADOTP01 napi.sprint.com

DRDC Internal

drdcweb_int

Failing

06 - DRDC DADOTP01 vanity.sprint.com

DRDC Internal

drdcweb_int

Failing

08 - DRDC DADOTP01 csb.sprint.com

RESIDC Internal

residcweb_int

OK

RESIDC PRES0870 WL-csg.sprint.com

RESIDC Internal

residcweb_int

OK

RESIDC PRES0871 WL-csg.sprint.com

Total Error Count: 4

A Group Combined Alert – Outage / Back to Normal Alert is a notification about a target or multiple targets that are failing and a target or multiple targets that were failing but are now back in normal state or non failing state. The following is an example Combined Alert – Outage / Back to Normal Alert email that is sent:

Location

Account

Status

Alias

DRDC Internal

drdcweb_int

Failing

PDASDOT2 WL7 - id2.sprint.com

DRDC Internal

drdcweb_int

Second Alert

PDASDOT3 WL7 - sprintbiz.com

DRDC Internal

drdcweb_int

OK

PDASDOT3 WL7 - sprintworldwide.com

DRDC Internal

drdcweb_int

OK

PDASDOT3 WL7 - id.sprint.com

Total Error Count: 2

Configuration and Setup

To configure or setup the Group Alerting Process for targets that are current being tested one needs to add entries to a single “configuration” file. This file is a flat or text file and the name of the file is “groupOutageAlert.config” and is in the same directory as the executable “groupOutageAlert”. On the production server IPREV2 the path to these files are:

cd /home/vwpoint/viewpoint/bin or type "gnwbin".

Use an editor to create or make changes to this file. The following is an example “configuration” file:

# GROUP OUTAGE ALERT CONFIGURATION FILE
#
# Notes: No blank lines are allowed in this file.  A comment can be an entire line where the first
# character is a # sign - everything on the line is ignored.  A comment can also end a line denoted by the
# # sign followed by the comment.  The PERIOD parameter must be the very first item in this file other
# than comments.  The Group field identifies the group where any combination of Grouptypes can be
# used.  The Grouptype field (tag) must be one of the following: SUBID, SUBID-SERVICECNT,
# SERVICEID.  The main thing to remember for a Grouptype is the number of parameters needed for
# each tag can be different.  Please note that parameters can be separated by either spaces or tabs.  The
# following illustrates the three tag's parameters:
#
#  tag:
#
# SUBID              subID        emailAddress  pagerAddress    comment-(optional)
# SUBID-SERVICECNT   subID        serviceCnt   emailAddress  pagerAddress     comment-(optional)
# SERVICEID          serviceID                 emailAddress  pagerAddress     comment-(optional)
#
##########################################################################################
#PERIOD 900 #MONITORS ALL ENTRIES IN THIS FILE EVERY (PERIOD) NUMBER OF SEC.
#
#Group   Grouptype     subID      emailAddress            pagerAddress           comment-(optional)
1        SUBID         7028       groupemail1@GNW.com     grouppager1@GNW.com    #Company1 Account
1        SUBID         7029       groupemail1@GNW.com     grouppager1@GNW.com    #Company1 Account
1        SUBID         7030       groupemail1@GNW.com     grouppager1@GNW.com    #Company2 Account
1        SUBID         7031       groupemail1@GNW.com     grouppager1@GNW.com    #Company2 Account
1        SUBID         7032       groupemail1@GNW.com     grouppager1@GNW.com    #Company3 Account
1        SUBID         7033       groupemail1@GNW.com     grouppager1@GNW.com    #Company3 Account
#
2        SUBID         9277       groupemail2@GNW.com     grouppager2@GNW.com    #Company4 Account
2        SUBID         9285       groupemail2@GNW.com     grouppager2@GNW.com    #Company4 Account
#
#Group  Grouptype          subID   ServiceCnt      emailAddress             pagerAddress
3       SUBID-SERVICECNT   9290    1               groupemail3@GNW.com      grouppager3@GNW.com
3       SUBID-SERVICECNT   9290    2               groupemail3@GNW.com      grouppager3@GNW.com
3       SUBID-SERVICECNT   9290    3               groupemail3@GNW.com      grouppager3@GNW.com
3       SUBID-SERVICECNT   9290    4               groupemail3@GNW.com      grouppager3@GNW.com
3       SUBID-SERVICECNT   9290    7               groupemail3@GNW.com      grouppager3@GNW.com
#
4       SUBID-SERVICECNT   9300    9               groupemail4@GNW.com      grouppager4@GNW.com
4       SUBID-SERVICECNT   9300    3               groupemail4@GNW.com      grouppager4@GNW.com
4       SUBID-SERVICECNT   9300    4               groupemail4@GNW.com      grouppager4@GNW.com
4       SUBID-SERVICECNT   9300    7               groupemail4@GNW.com      grouppager4@GNW.com
4       SUBID-SERVICECNT   9300    13              groupemail4@GNW.com      grouppager4@GNW.com
4       SUBID-SERVICECNT   9300    1               groupemail4@GNW.com      grouppager4@GNW.com
#
#Group   Grouptype    serviceID   emailAddress          pagerAddress         comment-(optional)
5        SERVICEID    22520       groupemail5@GNW.com   grouppager5@GNW.com  #Company8 Account
5        SERVICEID    22521       groupemail5@GNW.com   grouppager5@GNW.com  #Company8 Account
5        SERVICEID    22522       groupemail5@GNW.com   grouppager5@GNW.com  #Company8 Account
#Group   Grouptype    subID       emailAddress          pagerAddress         comment-(optional)
5       SUBID         8000       groupemail5@GNW.com     grouppager5@GNW.com  #Company9 Account
#Group  Grouptype          subID   ServiceCnt     emailAddress             pagerAddress
5       SUBID-SERVICECNT   9000    1              groupemail5@GNW.com      grouppager5@GNW.com
5       SUBID-SERVICECNT   9000    2              groupemail5@GNW.com      grouppager5@GNW.com
5       SUBID-SERVICECNT   9000    3              groupemail5@GNW.com      grouppager5@GNW.com
 

Please read the notes at the beginning of this file in order to follow the strict rules that this file must follow in order for it to be a valid “configuration” file. The very first line other than comments in this file must be the PERIOD parameter. This value informs the groupOutageAlert process how long it is to “sleep” between tests (in seconds). A group is denoted by a group number. In other words all targets that have the same group number are all part of the same group. There is no limit to how many targets can be part of the same group. And there is no limit to how many groups can exist. A very important point to remember is that the same email address and pager addresses must be used for every target that is part of the same group. You may be wondering, well if that is the case why duplicate this information for each target that is part of the same group? The answer to this question is that I was forcing strict rules for every target that was meant to be part of the same group. Please note that this will change in the future. In a future release of this process, one will be able to enter or change this information via a secure web page. And this future version will contain an XML formatted “configuration” file.

There are three different types of groups: SUBID, SUBID-SERVICECNT, and SERVICEID. One can enter any combination of the three grouptypes for the same group. If you look at the example “configuration” file, group number 5 contains all 3 group types. A SUBID group type refers to a target that is identified by its sub ID. A SERVICEID group type refers to a target that is identified by its service ID. And a SUBID-SERVICECNT group type refers to a target that is identified by its combination of sub ID – service count.

The strict syntax of this file must be followed in order for the groupOutageAlert process to startup. When the groupOutageAlert process is started the first thing it does is read this file. This “configuration” file basically tells the groupOutageAlert process what it needs to do. This file is read once during startup and the groupOutageAlert places all this information into memory. If there is any syntax problems with this “configuration’ file the groupOutageAlert process will not start.

Note: Some of the old email groups and pager groups listed in the config file are defined in /etc/aliases on iprev2. This was the old way of defining these mailing groups. Any new groups should be set up by creating an actual email forwarding group under the globalnetwatch.com domain name. This can be easily set up using the swishmail user interface. http://mail.globalnetwatch.com

Starting the Process

To start the groupOutageAlert process login to the production server IPREV2 as vwpoint (very important). First make sure there is not another groupOutageAlert process already running before you try and start it – go to the section Stopping the Process. Go to the appropriate directory where the executable is located which is the following path:

1. cd /home/vwpoint/viewPoint/bin or type "gnwbin"

2. Use the following unix command to start the groupOutageAlert process:

3. Verify that the process started successfully. Use the following unix command to do the verification:

4. If this unix command returns the PID of this process, then it was a normal startup.

Stopping the Process

To stop the groupOutageAlert process use the following unix command to get the process’s ID number:

1. ps –e | grep groupOutageAlert

2. Now that you have retrieved the PID use the following command to stop the process:

3. Verify the process died:

4. If the server returns with just the grep process, then the process has been successfully terminated.

Group Alerting Process Issue Resolution

If ever any problems are suspected with the groupOutageAlert process, a catch all file should catch the problem. The name of this file is groupOutageAlert.debug and resides in the same directory as the executable. Please note that this file is recreated every time the groupOutageAlert process is started.. On the production server the path to these files are:

One can use the following unix command to view the file:

Group Alerting Services (last edited 2011-08-11 19:28:08 by Bryce Camp)