Site Reliability Engineer
Job Description
The Site Reliability Engineer is responsible for providing support to users of AZZ's DGS (Digital Galvanizing System) software system. Users are located at corporate headquarters and 45 different AZZ Metal Coatings Plants.
Essential Functions
- Serve as a solutions provider.
- Provide leadership in problem-solving, utilizing technical skills and support tools to diagnose user and system issues.
- Serve as the point of escalation to report and monitor concerns on quality, timelines, or other critical issues needing immediate attention.
- Collaborate, develop and document policies and procedures relating to site reliability.
- Recommend new technologies to ensure quality and productivity.
- Ability to develop and drive real-time monitoring solutions that provide visibility into site health and key performance indicators.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Manage communication of product upgrades, change management, and release management.
- Maintain a deep understanding of the product, in accordance to specific business needs.
- Build positive working relationships with internal customers.
- Collaborate with team to track status, drive product fixes and feature requests to help develop software in an agile environment.
- Provide subject matter expertise in the areas of monitoring and site reliability.
- Report and communicate high-value metrics to leadership
Technical Qualifications / Experience
- Knowledge of architecture, design and business processes.
- Experience in multiple infrastructure disciplines and functions.
- Proficient in databases, application servers, cloud services platforms and REST APIs
Current Site Infrastructure
GitHub, Bitrise, XCode, Meraki MDM, REST APIs, AWS, Swift, Firebase, BigQuery, Google Data Studio, Tomcat, Splunk, Twilio, Oracle (19c DB, ORDS, AOP, EBS, APEX), JIRA, Microsoft 365