logo image

VoPay International Inc. Job Board

Systems Reliability Engineer (SRE)

Description

The purpose of the position in relation to the company as a whole:
The Systems Reliability Engineer plays a critical role in ensuring the optimal performance, availability, and security of our production environment. You will proactively identify and resolve complex issues, optimize our monitoring and alerting systems, contribute to our AWS infrastructure management and be involved with making changes to improve the security and reliability of our platform. This is a hands-on role that is ideal for a technical problem-solver who is passionate about platform stability.

Critical qualifications:
  • Bachelor's degree in Computer Science, Engineering, or a related field or equivalent work experience.
  • 3+ years of experience in a production support or Site Reliability Engineering (SRE) role.
  • Strong expertise in cloud infrastructure, particularly AWS.
  • Demonstrated proficiency in a server side programming language (Python, Java, PHP, Go, etc.).
  • Knowledge of Linux systems and networking concepts.
  • Awareness of and commitment to adhering to security best practices.
  • Excellent analytical and problem-solving skills.
  • Additional desirable qualifications:
  • Experience with Datadog or similar monitoring and alerting tools.
  • Familiarity with Infrastructure-as-Code tools (e.g., Terraform, CloudFormation).
  • Familiarity with compliance frameworks (e.g., SOC 2, PCI DSS)
  • Absolute minimum years of relevant experience required: 3 years
  • Does the role include supervisory responsibilities? If yes, provide details: No

Duties and responsibilities of the role:
Platform Stability:
  • Lead troubleshooting and resolution of complex production issues, often under tight time constraints.
  • Collaborate with cross-functional teams (development, security, product) to diagnose and address root causes of platform incidents.
  • Proactively manage capacity, tuning, and scaling initiatives to ensure optimal platform performance.
Monitoring & Alerting:
  • Maintain and enhance our Datadog monitoring and alerting systems to ensure critical issues are flagged promptly.
  • Develop and refine KPIs to track platform health, reliability, and user experience.
Security:
  • Assist in implementing and maintaining security controls and best practices across the platform, including access management, data encryption, and network segmentation.
  • Collaborate with the security team to implement incident response plans and participate in security incident investigations and resolution.
Infrastructure Management:
  • Actively contribute to the design, implementation, and maintenance of our AWS infrastructure.
  • Champion automation and infrastructure-as-code principles to improve efficiency and reliability.
Development:
  • Participate in development tasks relating to the security and stability of our platform, including bug fixes, feature improvements, and tool building.
  • Work environment:
Physical demands of the position:
  • While performing the duties of this job, the employee may be regularly required to stand, sit, talk, hear, reach, stoop, kneel, and use hands and fingers to operate a computer, telephone, and keyboard.

Compensation

$85,000.00 - $100,000.00 per year

Know someone who would be a perfect fit? Let them know!