web_crawl#

baf.utils.web_crawl.crawl_website(initial_url, max_depth=2, max_pages=20, format='markdown', base_url_prefix=None)[source]#

BFS crawler that collects URLs starting with base_url_prefix (if provided).

Parameters:

initial_url – str, starting point of crawl
max_depth – int, maximum link depth
max_pages – int, maximum number of pages to crawl
format – ‘html’ or ‘markdown’
base_url_prefix – str, optional, only URLs starting with this prefix are included

Returns:

A dictionary mapping each crawled URL to its content in the requested format (html or markdown).

baf.utils.web_crawl.normalize_url(url)[source]#

Normalize a URL by removing fragments and normalizing trailing slashes.