Sunday, March 3, 2013

A day in the life of: NOC agent

NOC: Network Operations Center

Group Role:
  • Monitor system health
  • Monitor network connections
  • Monitor batch job completion, job scheduling (rerun, pause, execute)
  • Tier 1 responses often documented in a 'run book'
  • Escalation of priority alerts to the appropriate support team
Note: A good place to get started in IT can be as a NOC agent

Start of day:
  1. Get a lot of coffee.
    • The NOC runs 24x7 and staring at a screen waiting for an alert can be a challenge.
    • Some NOCs run 12 hour shifts with agents working 3-4 days per week.
  2. Status transition
    • As each shift changes, status of open items need to be communicated to the agents coming in, such as:
      • Current outages and actions taken so far in the run book
      • Systems under maintenance and when to resume monitoring
      • Open requests (ie  start job #2456 @ 2:15AM)
      • Changes to on-call coverage
    • Confirm the information being handed off is correct (check the status board for open alarms)
  3. Monitor
    • Most NOCs have several dashboards that show the current state of the environment.
    • Alarms are usually very easy to recognize with big red indicators, warnings are yellow
    • A mailbox is usually used as well for alert messages, batch job failure or success notices and requests for batch jobs to be run/held.
  4. Alert!
    • An alert comes up. Follow the response checklist which may look like this:
      1. Confirm if the alert is a false positive
      2. Enter a ticket to record the event and assign to the NOC
      3. Perform any remediation steps documented for such an alert (if present)
      4. If the alert doesn't clear, follow the escalation procedures
        • P1 or Critical priority response is typically a voice hand-off to the on-call engineer who handles the system or application the alarm is for
        • Medium to low priority alarms, such as a QA sytem being offline are typically an email notice to the support team and ticket being assigned to them
      5. Update the ticket and reassign it to the appropriate support team/person.
  5. Batch Request
    • Users will request a batch job to run or be held very often. If one fails, re-running the job is the normal action taken.
    • Multiple failures: submit a ticket to the support team for remediation.
  6. Notices
    • Often engineers will notify the NOC to ignore alarms for a system as maintenance will be performed. Depending on the monitoring software, typically a NOC agent will put the system into 'maintenance mode' which suppresses alerts during a specified time window.
    • On-call coverage changes occur frequently. These changes to the default rotation should be noted and also communicated to the next shift.
  7. End of shift
    • Transition status of systems, maintenance windows and other details to the next shift.

End of Day.


Supplemental:
There is a lot of quiet time. NOC agents can best make use of this time towards training, independent study and certifications. A natural advancement is into one of the Infrastructure support teams.



No comments:

Post a Comment