REMOTE-Site Reliability (AWS and Kubernetes focused) Engineer
Job Summary
As a Site Reliability Engineer, you are primarily responsible for the reliability, availability, and scalability of our production and non-production environments. You work alongside development and infrastructure teams to make sure new and existing code releases are stable and available to our internal customers. You develop ongoing observability through tools such as Datadog. You provide front-line support to all environments and respond to incidents as needed.
Responsibilities
· Work with engineering teams to gain deep knowledge of how our applications work so you can provide support and observability.
· Develop and maintain monitoring, alerting, and logging through observability tools such as Datadog.
· Automate tasks to ensure systems can self-heal where possible.
· Evaluate current running systems and provide improvements for performance and reliability.
· Take part in capacity planning and scaling initiatives for new environments.
· Develop runbooks for the SRE team and other incident response teams.
· Participate in on-call rotation and respond to incidents for production environments.
· Long term goal of developing SLIs and SLOs with engineering teams.
Qualifications
·8 – 10 years of experience in software and infrastructure operations and support.
· Expertise in cloud platforms such as AWS, GCP, or Azure.
· Expertise in troubleshooting complex application and infrastructure issues with a focus on networking and messaging between services.
· Strong experience in modern software application and infrastructure performance monitoring and tuning.
· Strong experience with monitoring solutions such as Datadog or New Relic.
· Experience with application performance monitoring tools such as Datadog APM or similar.
· Experience working with containerized applications using Docker and running on Kubernetes or similar.
· Experience in automation scripting using Bash and/or Python.
· Experience in source control systems such as Git and hosted solutions like Bitbucket, GitHub, and/or Gitlab for CI/CD.
· Experience in release management and making sure applications are stable post release.
· Experience in Linux system administration.
· Experience with messaging systems such as AWS SQS, RabbitMQ, Pulsar, or Kafka.
· Familiarity with relational databases such as PostgreSQL and Microsoft SQL Server.
· Familiarity with transaction testing tools such as Datadog Synthetic Tests and RUM.
· Familiarity with SRE concepts such as SLIs, SLOs, SLAs, and error budgets.
· Excellent communication and collaboration skills in working with cross-functional teams.
· Ability to take ownership of a project or system and complete it or make improvements while providing extensive communication and documentation.
· Ability to work independently and handle multiple priorities.
· Have a proactive mindset and be comfortable with being the point person on critical tasks.
#LI-SC6
Occupational Therapy Assistant (COTA) Full-Time | Monday - Friday | Up to 40 hours per week Location: Skilled Nursing Facility, Dayton, OH Flexible Schedule Available Join Our Team at Broad River Rehab! At Broad River Rehab, we are dedicated to making a meaningful...
...Supercuts of South Miami needs an experienced and ambitious Salon Manager to lead our stylists in a positive and energetic work environment, and to ensure the success of our team and the salon. We offer a competitive salary with a sign-on bonus and a benefits package...
More than a translation management system With Bureau Works' context-first translation platform, localization managers, translation agencies, translators, and devs alike can translate, contextualize, and scale global experiences that get results - all backed by enterprise...
Job DescriptionDesign, develop, and maintain scalable web applications using Python (Django) or React (TypeScript).Architect and deploy cloud-native solutions using AWS services such as EC2, ECS (Docker), RDS, API Gateway, Lambda, Step Functions, and Load Balancers....
...and chemicals in containers in preparation for cleaning, according to instructions Follows procedures for the use of chemical cleaners and power machinery to prevent damage to floors and fixtures Manage inventory levels of cleaning materials and submit purchase...