
SRE jobs in USA
Title: SRE Engineer
Location- Dallas, TX (Onsite) (Locals Only)
Job Type: Contract
Must Have Skills: SRE, Triage, AWS
Subject:
Site Reliability Engineering (SRE) Triage Specialist is responsible for quickly assessing and prioritizing incidents and issues posted by the Monitoring team to determine the best course of action. This role is vital for minimizing downtime and ensuring that the right teams are engaged as efficiently as possible during a service disruption.
Roles and Responsibilities
Incident triage and initial response: Act as the first responder for alerts/incident and field reported production issues and escalate high-priority problems to the appropriate engineering or Application support team.
Initial investigation and diagnosis: Perform initial analysis to understand the scope and potential cause of an incident, gathering key metrics, logs, and traces.
Runbook execution: Follow predefined runbooks and standard operating procedures to mitigate and resolve common issues swiftly.
Technical collaboration: Work with implementation, Engineering and Application SMEs, and other stakeholders during an incident to provide real-time updates and coordinate resolution efforts.
Communication: Provide clear and timely updates to stakeholders about the status of an incident, including warm shift handoffs.
Documentation and knowledge management: Summarize incident details, update knowledge bases, and contribute to the Problem Resolution Database.
Process improvement: Identify opportunities to improve observability, and incident response procedures based on daily triage activities.
Key Skills
Systems knowledge: Strong understanding of Linux/Unix systems and AWS (EKS, EC2, ELB, NLB)
Monitoring and logging: Experience with observability tools such as ELK, Grafana, and NewRelic.
Scripting: SQL.
Containerization: Familiarity with Docker and Kubernetes.
To apply for this job email your details to hiringjack1926@gmail.com