Group Role:
- Monitor system health
- Monitor network connections
- Monitor batch job completion, job scheduling (rerun, pause, execute)
- Tier 1 responses often documented in a 'run book'
- Escalation of priority alerts to the appropriate support team
Note: A good place to get started in IT can be as a NOC agent
Start of day:
- Get a lot of coffee.
- The NOC runs 24x7 and staring at a screen waiting for an alert can be a challenge.
- Some NOCs run 12 hour shifts with agents working 3-4 days per week.
- Status transition
- As each shift changes, status of open items need to be communicated to the agents coming in, such as:
- Current outages and actions taken so far in the run book
- Systems under maintenance and when to resume monitoring
- Open requests (ie start job #2456 @ 2:15AM)
- Changes to on-call coverage
- Confirm the information being handed off is correct (check the status board for open alarms)
- Monitor
- Most NOCs have several dashboards that show the current state of the environment.
- Alarms are usually very easy to recognize with big red indicators, warnings are yellow
- A mailbox is usually used as well for alert messages, batch job failure or success notices and requests for batch jobs to be run/held.
- Alert!
- An alert comes up. Follow the response checklist which may look like this:
- Confirm if the alert is a false positive
- Enter a ticket to record the event and assign to the NOC
- Perform any remediation steps documented for such an alert (if present)
- If the alert doesn't clear, follow the escalation procedures
- P1 or Critical priority response is typically a voice hand-off to the on-call engineer who handles the system or application the alarm is for
- Medium to low priority alarms, such as a QA sytem being offline are typically an email notice to the support team and ticket being assigned to them
- Update the ticket and reassign it to the appropriate support team/person.
- Batch Request
- Users will request a batch job to run or be held very often. If one fails, re-running the job is the normal action taken.
- Multiple failures: submit a ticket to the support team for remediation.
- Notices
- Often engineers will notify the NOC to ignore alarms for a system as maintenance will be performed. Depending on the monitoring software, typically a NOC agent will put the system into 'maintenance mode' which suppresses alerts during a specified time window.
- On-call coverage changes occur frequently. These changes to the default rotation should be noted and also communicated to the next shift.
- End of shift
- Transition status of systems, maintenance windows and other details to the next shift.
End of Day.
Supplemental:
There is a lot of quiet time. NOC agents can best make use of this time towards training, independent study and certifications. A natural advancement is into one of the Infrastructure support teams.
No comments:
Post a Comment