Hi,
Site Reliability Engineer
Atlanta, GA
Looking for a highly skilled and experienced Site Reliability Engineer (SRE) with strong background in monitoring and alerting systems, particularly using Splunk. The primary focus of this role will be to reduce onscreen monitoring, ensuring only actionable alerts are in place, and implementing proactive alerting mechanisms.
Responsibilities:
• Develop and refine alerting strategies to minimize onscreen monitoring and focus on actionable alerts.
• Implement proactive alerting mechanisms to identify and address potential issues before they impact the system.
• Collaborate with other IT teams to ensure reliability and performance of applications.
• Automate repetitive tasks to improve efficiency and reduce manual intervention.
• Conduct root cause analysis and post-mortem reviews to prevent recurrence of issues.
• Continuously improve monitoring and alerting processes to enhance system reliability.
Requirements:
• 5+ years of Core experience in Site Reliability Engineering Principals
• Develop and implement automation tools and processes to improve efficiency, reduce downtime, and enhance system reliability.
• Monitor and troubleshoot system issues, identifying root causes and implementing fixes.
• Builds, modifies, and monitors real time dashboards
• Design & Implements defined SLIs & SLOs
• Knowledge of tools such as Splunk, DataDog, New Relics
• Assists in triaging and resolution process as needed
• Identifies use cases for toil reduction through detection and resolution
• Proposes changes to improve observability and assist engineers in implementation
• Strong scripting and automation skills
• Ability to work collaboratively in a team environment.
• Strong communication skills to effectively convey technical concepts to non-technical stakeholders.
• Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes).
Agile Project Management Workforce Management | Saif Syed Manager/Sr Recruiter 7520 Standish Place,St 260 Rockville MD 20855 |