Launching smolcrawl - Simple Web Scraping for LLMs and Knowledge Bases

I'm excited to introduce smolcrawl, a new, lightweight Python tool designed to simplify web scraping and knowledge base creation. SmolCrawl streamlines the process of extracting, organizing, and searching web content, making it ideal for developers, researchers, and anyone looking to build personal knowledge collections.

SmolCrawl addresses the common challenges of transforming web content into searchable, local knowledge. It offers a simple yet powerful solution for crawling websites, intelligently extracting relevant content, converting it to clean Markdown, and creating fast search indexes. Key features include:

Simple Web Crawling: Crawl entire websites with a single command.
Intelligent Content Extraction: Extract meaningful content, filtering out irrelevant elements.
Clean Markdown Conversion: Convert HTML to readable Markdown format.
Fast Search Indexing: Utilize Tantivy-based full-text search.
Flexible Output Options: Supports search indexes, Markdown files, and XML exports.

SmolCrawl empowers you to build your own mini search engines and knowledge bases with minimal setup.

To learn more, check out the launch blog post: SmolCrawl Launch Post

The project is also available on GitHub: SmolCrawl on GitHub