Primary Function of the Position
Reporting to the Site Reliability Engineer Team Lead, the Site Reliability Engineer will be responsible for ensuring the reliability, scalability and performance of our systems.
The responsibilities include:
- Develop the Site Reliability Engineering culture across the team by applying best practices, approaches and code.
- Apply automation and propose/implement software to any tasks or parts of the system that would deliver benefit.
- Monitor application performance – identifying, and implementing, improvements to application performance and stability.
- Collaborate with the design and implementation of the desired pipelines and process for deployment to production environment.
- The SRE will work closely with Platform and Software domains to ensure continuous improvement of performance and stability whilst adhering to standards.
- Undertake ad-hoc projects and other activities as required.
Key Accountabilities and Activities
1 | Contribute to the SRE function including: - Drive evolution of the DevOps / GitOps toolchain, promoting improvements to streamline the software delivery process and showing improvements through metrics.
- Accountable for halting or stopping a project/product if the solution is not technically acceptable.
- Responsible for producing and maintaining documentation relating to application design, integration processes, testing procedures, and deployment approach as well as collaborating with teams to create operational run and playbooks.
|
2 | Integration with Domains including: - Collaborating with Domains to plan, design, test and maintain the application.
- Design patterns for any component or structure under SRE responsibility.
- Implementation of components such as Monitoring and Logging.
- Manage the runbook preparations of Domains.
|
3 | Liaise and support other teams on work items including: - Developing, refining, and tuning integrations between application elements.
- Collaborate with stakeholders in the Enterprise, Solution and Development teams to produce and maintain standards and guidelines.
- Knowledge sharing and education of team members across the organisation.
- Act as first point of contact for the Problem management and Process Outcomes team.
|
4 | Build and guide successful SRE efforts including: - Analysing and resolving technical and application issues.
- Researching and evaluating software products.
- Evaluate risks and defects, analysing specifications, and customising applications for specific customer needs.
- Identify complex and manual processes and work to simplify and automate them.
- Continuously review capabilities and roles critical to evolving DevOps and quality assurance practices and be responsible for the acquisition, development, and maturity of these.
- Minimising outages by continuous improvement.
|
5 | Undertake ad-hoc projects and other activities as required. |
Experience and Skills
Essential
- Experience and demonstratable knowledge of SRE best practices
- Expert in logging and monitoring solutions (Prometheus, Grafana etc.)
- Demonstratable knowledge of Cloud
- Expert knowledge of Kubernetes
- Proficient ability to communicate in English (Written and Verbal)
- Understanding of non-functional testing
- Significant DevOps experience
Desirable
- Proven ability to work independently and collaboratively in a fast-paced technical environment.
- Demonstratable knowledge of the telecommunications industry and technologies.
- Proven experience and ability to provide support to direct reports.
- Golang skills and experience
•
Last updated on Sep 27, 2024