Skip to content

preprocess ¤

HTML pre-processing.

Functions:

  • autoclean

    Auto-clean the soup by removing elements.

  • preprocess

    Pre-process HTML with user-defined functions.

autoclean ¤

autoclean(soup: BeautifulSoup) -> None

Auto-clean the soup by removing elements.

Parameters:

Source code in src/mkdocs_llmstxt/preprocess.py
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def autoclean(soup: Soup) -> None:
    """Auto-clean the soup by removing elements.

    Parameters:
        soup: The soup to modify.
    """
    # Remove unwanted elements.
    for element in soup.find_all(_to_remove):
        element.decompose()

    # Unwrap autoref elements.
    for element in soup.find_all("autoref"):
        element.replace_with(NavigableString(element.get_text()))

    # Unwrap mkdocstrings div.doc-md-description.
    for element in soup.find_all("div", attrs={"class": "doc-md-description"}):
        element.replace_with(NavigableString(element.get_text().strip()))

    # Remove mkdocstrings labels.
    for element in soup.find_all("span", attrs={"class": "doc-labels"}):
        element.decompose()

    # Remove line numbers from code blocks.
    for element in soup.find_all("table", attrs={"class": "highlighttable"}):
        element.replace_with(Soup(f"<pre>{element.find('code').get_text()}</pre>", "html.parser"))

preprocess ¤

preprocess(
    soup: BeautifulSoup, module_path: str, output: str
) -> None

Pre-process HTML with user-defined functions.

Parameters:

  • soup (BeautifulSoup) –

    The HTML (soup) to process before conversion to Markdown.

  • module_path (str) –

    The path of a Python module containing a preprocess function. The function must accept one and only one argument called soup. The soup argument is an instance of bs4.BeautifulSoup.

  • output (str) –

    The output path of the relevant Markdown file.

Returns:

  • None

    The processed HTML.

Source code in src/mkdocs_llmstxt/preprocess.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def preprocess(soup: Soup, module_path: str, output: str) -> None:
    """Pre-process HTML with user-defined functions.

    Parameters:
        soup: The HTML (soup) to process before conversion to Markdown.
        module_path: The path of a Python module containing a `preprocess` function.
            The function must accept one and only one argument called `soup`.
            The `soup` argument is an instance of [`bs4.BeautifulSoup`][].
        output: The output path of the relevant Markdown file.

    Returns:
        The processed HTML.
    """
    try:
        module = _load_module(module_path)
    except Exception as error:
        raise PluginError(f"Could not load module: {error}") from error
    try:
        module.preprocess(soup, output)
    except Exception as error:
        raise PluginError(f"Could not pre-process HTML: {error}") from error