Must Have Skills:
Overall with 7+ years of experience with proven 5+ years as an Observability Engineer, Site
Reliability Engineer (SRE), or similar role.
Strong proficiency in implementing and maintaining observability tools, such as New Relic,
Datadog, Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana), TICK Stack.
Solid experience with instrumentation practices, including metrics, logging, and distributed
tracing.
Familiarity with cloud platforms and containerization technologies (Docker, Kubernetes).
Proficiency in scripting and automation using languages such as Python, Bash, or PowerShell.
Excellent problem-solving skills with the ability to analyze complex issues and provide
efficient solutions.
Strong communication skills and the ability to collaborate effectively across teams.
Understanding of Agile and DevOps principles and their application in observability and
monitoring contexts.
Relevant certifications in observability tools and practices (e.g Certified Prometheus
Practitioner) are a plus.
Roles and Responsibilities:
Develop and implement observability solutions to gain insights into application and
infrastructure performance, availability, and reliability.
Collaborate with development, operations, and other teams to instrument applications and
services for metrics, logs, traces, and other relevant data.
Design and implement monitoring solutions using industry-standard tools and practices to
detect, analyze, and mitigate incidents and anomalies.
Create and manage dashboards, alerts, and visualization tools to provide real-time visibility
into system behavior and performance.
Perform in-depth analysis of system behavior and trends to identify areas for improvement,
optimization, and increased efficiency.
Troubleshoot complex issues by analyzing data from various sources to quickly diagnose and
resolve incidents, minimizing downtime.
Continuously evaluate and recommend improvements to observability processes, tools, and
practices to align with industry best practices.
Contribute to the development of automation scripts and tools to enhance observability and
incident response.
Collaborate with development teams to improve application design for better observability,
including implementing distributed tracing and structured logging.
Stay updated with emerging trends, technologies, and methodologies in observability,
monitoring, and performance analysis.
Last updated on Nov 20, 2023
Chennai, Tamil Nadu
·30+ days ago
Chennai, Tamil Nadu
·30+ days ago
North Charleston, South Carolina
·30+ days ago
Chennai, Tamil Nadu
·30+ days ago
Raritan, New Jersey
·30+ days ago
New York, New York
·30+ days ago
San Francisco, California
·30+ days ago
30+ days ago
Remote
·30+ days ago
Remote
·30+ days ago
Des Moines, Iowa
·30+ days ago
South Jordan, Utah
·30+ days ago
Tampa, Florida
·30+ days ago
California
·30+ days ago
Minneapolis, Minnesota
·30+ days ago