Senior Site Reliability Engineer (Observability)
Richmond Hill, Ontario, Canada - Permanent
Job Description
About the Team
Our client’s platform engineering group operates with a Site Reliability Engineering (SRE) mindset, committed to delivering highly reliable, scalable, and performant systems across a public cloud infrastructure. The team specializes in enhancing system transparency, enabling deep diagnostics, and ensuring seamless collaboration between development and operations. Shared ownership, proactive problem-solving, and continuous improvement are at the core of everything they do.
⸻
The Opportunity
As a Site Reliability Engineer, you will be responsible for the design, development, deployment, and further, the management and support of public cloud infrastructure. The candidate should have experience with designing highly available and fault tolerant cloud native enterprise solutions. As well as some background in development, the candidate should also have familiarity with Kubernetes. The role requires someone with experience interfacing with development teams throughout the full development lifecycle to produce reliable and secure production infrastructure and operate in multiple environments in the SDLC.
What You’ll Be Doing
● Lead incident response and perform Root Cause Analysis (RCA) to prevent recurrence and improve system resilience.
● Define, build, and maintain robust observability solutions (monitoring, metrics, logging, and alerting) for infrastructure and applications.
● Design and develop operational tooling to automate repetitive tasks and improve system efficiency.
● Develop and maintain Infrastructure as Code (IaC) for Kubernetes cluster management and AWS resource provisioning.
● Maintain and evolve existing infrastructure and automation codebases (IaC).
● Interface with and support development teams to migrate on-premises solutions to the public cloud.
Must Have Skills:
What You’ll Need to Succeed
Must-Haves
● Bachelor’s degree in Computer Science or related field.
● 5+ years of SRE, DevOps, or Cloud Engineering experience.
● Strong proficiency in Python for scripting and tooling, with additional experience in either Node.js or Java.
● Expert troubleshooting and analytical skills with a proven ability to conduct Root Cause Analysis (RCA).
● Hands-on experience with containerization (Docker) and orchestration (Kubernetes).
● Deep knowledge of Linux fundamentals, networking (TCP/IP), and core OS concepts.
● Experience with Infrastructure as Code (IaC) tools such as Terraform, SaltStack, or Ansible.
● Experience with AWS services (Compute, Storage, Networking).
● Proven experience with metric-based monitoring tools (e.g., Prometheus) and alerting systems.
● Proficiency with web servers such as Nginx, with a solid understanding of how web servers work.
● Ability to read, write, and debug production-level code to trace complex application flows.
Nice-to-Haves
● Experience with Elasticsearch and Application Performance Monitoring (APM) tools.
● Experience with ArgoCD and advanced CI/CD pipelines.
● Experience with large-scale, multi-region cloud projects.
● AWS Associate Certification or higher.
● Detailed knowledge of AWS services: EC2, S3, VPC, ELB/NLB/ALB, Lambda, and CloudWatch.
● Experience with Cloudflare or equivalent Content Delivery Network (CDN) solutions.
This is a full-time position. Days and hours of work are Monday through Friday, during normal business hours. This position will also participate in on-call rotation which will be 2 weeks of primary and 2 weeks of secondary. This is offering 24/7 support for the platform during these rotations. Typically, this is 4 out of every 8 weeks.