View Our Website View All Jobs

Manager Site Reliability & Infrastructure

Job Summary: 

The Internet Archive has over 30PB of unique digital information... all running across an integrated cluster of over 700 VMs on over 600 "bare-metal" hosts in 2 data centers. We are looking for a "hands-on" operations manager with proven experience effectively managing and participating in a high-performance team of system administrators and technical operations staff. The ideal candidate will be looking to take on a "player-coach" role and have demonstrated experience improving and maintaining the reliability, performance, and security of both internal and publicly-facing infrastructure.

Key Responsibilities: 

  • Manage, contribute to, and mentor the technical team responsible for monitoring, maintaining, and restoring the health of all Internet Archive networks and online services. This includes all publicly-facing services, the storage and compute cluster, as well as key internal services related to crawling, indexing, and access to archived web content
  • Ensure service continuity and performance.  Maintain and expand monitoring and reporting systems to communicate current and historical activity for multiple publicly facing services. 
  • Analyze, implement, and manage effective improvements in the maintenance and operations processes and infrastructure.
  • Plan and coordinate the transition of new software systems and service applications from a development into a production footing. This includes establishing procedures and policy that will ensure sustainable deployment, monitoring, upgrade and expansion of services.
  • Assign, support, recruit, hire, schedule, and fire staff as needed to sustain operational objectives and efficiency.
  • Recommend the purchase of equipment needed to sustain responsive services and cost-effective operations.

Minimum Qualifications:

  • Experience managing large server cluster infrastructure
  • Experience as lead manager and mentor of a technical operations team
  • Ability to document and share critical knowledge with others 
  • "Customer Service" mentality - advocate for the end user experience of web-delivered services
  • Passion for automation, data-driven decision making, and information reporting
  • Fluency in Linux system administration
  • Creative problem solver.
  • Passion for staying current with industry trends

Preferred Qualifications:

  • Experience with Kubernetes
  • Experience with Elastic stack
  • Experience deploying and maintaining Hadoop
  • Excellent oral/written communication and documentation skills
  • Flexibility and a sense of humor

Reporting Structure: The Manager of Site Reliability and Infrastructure reports to the Director of Engineering and works closely with the Head Librarian and Founder.

Location: Inner Richmond, San Francisco, CA and City of Richmond, CA; ON-SITE PRESENCE IN SF/RICHMOND IS REQUIRED! Remote employment is not available for this position

Job Classification: Full-time, Exempt

Internet Archive is an Equal Opportunity Employer M/F/D/V and will consider for employment, qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

Read More

Apply for this position

Required*
Apply with Indeed
Attach resume as .pdf, .doc, or .docx (limit 2MB) or Paste resume

Paste your resume here or Attach resume file

To comply with government Equal Employment Opportunity / Affirmative Action reporting regulations, we are requesting (but NOT requiring) that you enter this personal data. This information will not be used in connection with any employment decisions, and will be used solely as permitted by state and federal law. Your voluntary cooperation would be appreciated. Learn more.
Gender
Race
Veteran/Disability status