web_crawl#

baf.utils.web_crawl.crawl_website(initial_url, max_depth=2, max_pages=20, format='markdown', base_url_prefix=None)[source]#

BFS crawler that collects URLs starting with base_url_prefix (if provided).

Parameters:
  • initial_url – str, starting point of crawl

  • max_depth – int, maximum link depth

  • max_pages – int, maximum number of pages to crawl

  • format – ‘html’ or ‘markdown’

  • base_url_prefix – str, optional, only URLs starting with this prefix are included

Returns:

A dictionary mapping each crawled URL to its content in the requested format (html or markdown).

baf.utils.web_crawl.normalize_url(url)[source]#

Normalize a URL by removing fragments and normalizing trailing slashes.

Parameters:

url – URL to normalize.

Returns:

The normalized URL string.