View Our Website View All Jobs

Web Crawl Engineer

The Web Crawl Engineer works with our web crawl engineering team and is responsible for capturing and managing the highest quality content from the web. The ideal candidate demonstrates independence and initiative, is a problem solver, works well autonomously, and is technologically savvy. Additionally, the ideal candidate is open to being trained in, and helping advance, best practices and standards around large-scale web harvests, web data processing and engineering, and contributing to the development of new harvesting, access, and analysis tools.

The position will work in the Web Archiving Group in support of web harvesting services and programs working with partners ranging from national libraries and archives to collaborative international initiatives supporting the collection, preservation, and accessibility of web content. The role will help design the strategy and implementation of web archiving services using open source technologies and platforms, develop harvest techniques and tools to enable archival capture and re-rendering of rich media, streaming content, social media, as well as traditional web page content. The position will also create tools, services, and workflows to improve crawl analysis, reports, data management and derivation, and identify technical and operational requirements. This role contributes to defining deployment architectures and workflows, managing data at scale, and monitoring production systems.

Responsibilities and Duties

  • Running large-scale web harvests on global and national domain levels and focused and specialized crawls using Heritrix, our open-source crawler, as well as other open-source technologies developed internally, including Umbra, Brozzler, warcprox and others.
  • Configuration, monitoring, and improvement of large-scale web crawls to ensure their quality and timely completion.
  • Processing, analysis and quality assurance of archived web content to ensure it is complete and of the highest quality.
  • Contribute to development of tools for automated analysis and reporting of crawl material, and to development projects focused on crawling, processing, and access.
  • Manage both large ingests and exports of web data, derivatives, logs, and reports.
  • Demonstrated experience of delivering on commitments with deadlines and project timelines and working in a collaborative team of engineers and project/product managers.

Skills & Requirements

  • Experience in Unix shell scripting and Python coding required
  • Experience with web crawlers or scrapers, especially Heritrix
  • Solid experience in Internet protocols (HTTP is must.) Strong knowledge of HTML, JavaScript and Web technologies in general
  • Ability to work in, and enjoy, a loosely structured work environment

Preferred Qualifications

  • Knowledge of building and deploying web applications, databases, web-host services, and knowledge of basic Linux system administration
  • Familiarity with Java build tools, including maven, strongly preferred
  • Cluster computing experience is preferred, especially familiarity with Hadoop and related technologies and tools, particularly HDFS
  • Experience with system monitoring/administration tools
  • Experience with version control, open source practices, and code review
  • Experience with Atlassian tool sets (Jira and Confluence)
  • Experience with applications designed to display archived web content, especially server-side apps and Wayback
  • Flexibility and a sense of humor are a plus
  • Bachelor's Degree in Computer Science or a related field or the equivalent demonstrated experience in software development.

Reporting Structure: The Web Crawl Engineer reports to the Web Archiving Engineering Manager and works closely with other departments. The position works alongside other web archiving engineers as well as program staff in Web Archiving and Data Services Group and with the broader Internet Archive infrastructure and engineering teams.

To Apply: Please send your resume and cover letter to jobs+crawlengineer@archive.org with the subject line "Web Crawl Engineer."

Internet Archive reserves the right to revise job descriptions or work hours as required.

Benefits & Perks

The Internet Archive provides a comprehensive benefits package including; PTO, paid holidays, medical, dental, vision, FSA, commuter, STD, LTD, 403B/Roth accounts and Friday lunches at IA HQ.

Internet Archive is an Equal Opportunity Employer M/F/D/V/L/G/B/T and will consider for employment, qualified applicants with criminal histories in a manner consistent with the requirements of the Fair Chance Ordinance.

 

Read More

Apply for this position

Required*
Apply with Indeed
Attach resume as .pdf, .doc, or .docx (limit 5MB) or Paste resume

Paste your resume here or Attach resume file

150
To comply with government Equal Employment Opportunity / Affirmative Action reporting regulations, we are requesting (but NOT requiring) that you enter this personal data. This information will not be used in connection with any employment decisions, and will be used solely as permitted by state and federal law. Your voluntary cooperation would be appreciated. Learn more.
Gender
Race/Ethnicity
Veteran/Disability status