Recover From Late Result Data
When a significant event occurs on a node, the network between the node and server, or the server itself, to an extent that the the monitoring node can no longer communicate its results back to the server then the node will queue up the result data. During this time however there can be no alerts that can occur based on a result from this node. Even when things are working again, it can be several hours before the nodes result data can be transmitted in its entirety back to the production server. During this catch-up scenario there will be no alerts to customers based on the results of this node.
The following high-level steps must be performed to remedy this situation.
- Ensure that server is capable of receiving data
- Verify that node is up and can reach server
Ensure that node is monitoring and transmitting data (even if only in catch-up state)
Invoke special procedures to offline catch-up process
Offline Catch-Up Procedures
The following procedures outline how to take the catch-up process offline so that the monitoring node can be involved in the current alerting process. Remember that while a node is catching up with old monitor results that it cannot participate in the alert process (alert threshold must be met by other nodes alone).
Capture Old Monitor Results
Login as vwpoint on the late node. Change directory to /home/vwpoint/viewPoint/var. Invoke the following steps between 5 minute check cycles so as not to miss a check cycle. These procedures have effectively reset the monitor node result data so that the node will begin transmitting only current data.
cd /home/vwpoint/viewPoint/var gnwstop mv xx.yy.zz.qq xx.yy.zz.qq.OLD mv data.yymmdd data.yymmdd.OLD touch data.yymmdd echo 0000000000 > xx.yy.zz.qq gnwstart
Import Old Monitor Results
Now you need to take the OLD files and import into the server in an offline fashion using the following steps.
Login as vwpoint on the production server. Create a temporary recovery folder with a <hostname>/var subfolder and copy the data.yymmdd and the xx.yy.zz.qq data files from the node in question.
mkdir -p /home/vwpoint/recovery/aplus/var cd /home/vwpoint/recovery/aplus/var scp 'vwpoint@aplus:/home/vwpoint/viewPoint/var/*.OLD' . mv xx.yy.zz.qq.OLD xx.yy.zz.qq mv data.yymmdd.OLD data.yymmdd
Now import that data using sndMonRes with debug (-d 1) and observer the output from this command. You can also monitor that data using the www.globalnetwatch.com site and watch the node statistics as they come in.
export VIEWPOINT=/home/vwpoint/recovery/aplus sndMonRes -i xx.yy.zz.qq -d 1