Skip to content

fix: strip stray <p></p> HTML tags from imported Jupyter markdown cells#8674

Open
giulio-leone wants to merge 5 commits intomarimo-team:mainfrom
giulio-leone:fix/ipynb-strip-html-paragraph-tags
Open

fix: strip stray <p></p> HTML tags from imported Jupyter markdown cells#8674
giulio-leone wants to merge 5 commits intomarimo-team:mainfrom
giulio-leone:fix/ipynb-strip-html-paragraph-tags

Conversation

@giulio-leone
Copy link
Contributor

Summary

When importing Jupyter notebooks (e.g. SM_sphere_S2.ipynb), markdown cells containing <p>…</p> HTML paragraph wrappers are kept verbatim inside mo.md(). This breaks LaTeX rendering because the markdown/math renderer does not process LaTeX delimiters that appear inside HTML <p> elements.

Example

Jupyter markdown cell:

<p>We declare that $\mathbb{S}^2 = U \cup V$:</p>

Before (broken in marimo): The <p> tags prevent the LaTeX formula from rendering.

After (this fix): Converted to plain text:

We declare that $\mathbb{S}^2 = U \cup V$:

Fix

Added a _strip_paragraph_tags() helper in marimo/_convert/ipynb/to_ir.py that removes bare <p>/</p> tags (including those with attributes like <p class="...">) from markdown cell source before passing it to markdown_to_marimo().

Other HTML tags (<div>, <span>, <em>, <strong>, etc.) are preserved — only paragraph wrappers are stripped since they are redundant in plain markdown.

Closes #8651

Copilot AI review requested due to automatic review settings March 13, 2026 05:36
@vercel
Copy link

vercel bot commented Mar 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment Mar 16, 2026 5:55pm

Request Review

@github-actions
Copy link

github-actions bot commented Mar 13, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Jupyter (.ipynb) importer to normalize markdown cell sources by removing HTML paragraph wrappers (<p ...> / </p>) before converting them into mo.md(...), improving LaTeX rendering for notebooks that embed math inside <p> tags.

Changes:

  • Add a _strip_paragraph_tags() helper (and compiled regex) to remove <p> wrappers from markdown cell source.
  • Apply the stripping step during convert_from_ipynb_to_notebook_ir() before calling markdown_to_marimo().
Comments suppressed due to low confidence (1)

marimo/_convert/ipynb/to_ir.py:1431

  • This new import-time normalization changes markdown cell semantics, but there’s no targeted test asserting the intended behavior (strip <p> wrappers, preserve other HTML, and avoid breaking code fences). Since the repo has an established ipynb importer test suite (e.g. tests/_convert/ipynb/test_ipynb_to_ir.py), please add a unit test covering representative <p>-wrapped markdown cells.
        if is_markdown:
            source = _strip_paragraph_tags(source)
            cell_meta = cell.get("metadata", {})
            md_prefix = cell_meta.get("marimo", {}).get(
                "md_prefix", DEFAULT_MARKDOWN_PREFIX
            )
            source = markdown_to_marimo(source, prefix=md_prefix)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +45 to +61
_PARAGRAPH_TAG_RE = re.compile(
r"<p(?:\s[^>]*)?>|</p>",
re.IGNORECASE,
)


def _strip_paragraph_tags(source: str) -> str:
"""Remove bare ``<p>`` / ``</p>`` HTML tags from markdown source.

Jupyter markdown cells often wrap content in ``<p>…</p>`` tags which are
redundant in plain markdown and can break LaTeX rendering inside
``mo.md()``. This helper removes them while preserving all other HTML
tags and the text content.
"""
return _PARAGRAPH_TAG_RE.sub("", source)


Comment on lines +56 to +61
``mo.md()``. This helper removes them while preserving all other HTML
tags and the text content.
"""
return _PARAGRAPH_TAG_RE.sub("", source)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are good suggestions and we can add test cases for each one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mscolnick! I'll add test cases for these edge cases — specifically:

  1. Preserving <p> tags inside fenced code blocks
  2. Handling adjacent paragraph tags (<p>a</p><p>b</p>) with proper separation

Will push an update shortly.

Copy link
Contributor

@mscolnick mscolnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution! could we add some tests for this? could we also check that it does not strip <p style="color: red"> since that is a real change to output.

@giulio-leone
Copy link
Contributor Author

Added tests and addressed reviewer feedback:

Improvements to _strip_paragraph_tags:

  • Now skips <p>/</p> tags inside fenced code blocks (both backtick and tilde fences), so HTML examples in documentation/code are preserved
  • Replaces closing </p> tags with a newline instead of removing them, preventing adjacent paragraphs from collapsing (e.g. <p>a</p><p>b</p> no longer becomes ab)
  • Cleans up excessive blank lines introduced by replacements

12 test cases added in tests/_convert/ipynb/test_strip_paragraph_tags.py:

  • Basic tag removal
  • Tags with attributes (<p class="...">)
  • Case-insensitive matching
  • Adjacent paragraph separation
  • Fenced code block preservation (backtick and tilde)
  • Multiline paragraphs
  • No-tags passthrough
  • Empty string
  • LaTeX content preservation
  • Nested HTML preservation

All tests verified locally.

Jupyter notebooks often wrap paragraph text in <p>...</p> tags which
Jupyter renders natively. When converting to marimo, these tags are
kept verbatim inside mo.md() where they interfere with LaTeX rendering
— LaTeX delimiters inside <p> elements are not processed by the
markdown/math renderer.

Add a _strip_paragraph_tags() helper that removes bare <p>/<p> tags
(including those with attributes like <p class='...'>) before passing
the markdown source to markdown_to_marimo(). Other HTML tags (<div>,
<span>, <em>, <strong>, etc.) are preserved.

Closes marimo-team#8651
Address reviewer suggestions from @mscolnick and @Copilot:
- Skip <p>/<\/p> tags inside fenced code blocks (backtick and tilde)
- Replace </p> with newline to preserve paragraph separation
- Add 12 test cases covering: basic removal, attributes, case sensitivity,
  adjacent paragraph separation, fenced code block preservation (backtick
  and tilde), multiline, passthrough, empty string, LaTeX, nested HTML
Address review feedback from mscolnick:
- Change regex to only match bare <p> (no attributes)
- Styled tags like <p style="color: red"> are preserved
- Pair-matching ensures </p> is only stripped when closing a bare <p>
- Add tests for styled tag preservation, mixed bare+styled, adjacent styled
@giulio-leone
Copy link
Contributor Author

Updated the implementation based on the review feedback:

Changes

  • Styled <p> tags are now preserved — only bare <p> (no attributes) are stripped. Tags like <p style="color: red"> or <p class="lead"> remain intact since they carry semantic meaning.
  • Stack-based pair matching — each </p> is matched to its corresponding <p> opener. Only pairs where the opener is bare get stripped.
  • Fenced code blocks still protected<p> tags inside ``` or ~~~ blocks are never touched.

Tests added

  • test_preserves_styled_p_tag<p style="color: red"> stays
  • test_preserves_p_with_class<p class="lead"> stays
  • test_preserves_p_with_id<p id="intro"> stays
  • test_mixed_bare_and_styled — bare stripped, styled preserved in same cell
  • test_adjacent_styled_paragraphs_fully_preserved — adjacent styled pairs intact
  • All existing tests updated and passing

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Jupyter notebook (.ipynb) importer by preprocessing markdown cell content to remove problematic HTML paragraph wrappers that interfere with downstream markdown/LaTeX rendering in mo.md().

Changes:

  • Add _strip_paragraph_tags() to remove bare <p>...</p> wrappers from imported markdown source (with fenced code block protection).
  • Apply paragraph-tag stripping during markdown cell conversion in convert_from_ipynb_to_notebook_ir().
  • Add unit tests covering paragraph stripping behavior and fenced-code-block preservation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
marimo/_convert/ipynb/to_ir.py Introduces _strip_paragraph_tags() and applies it to markdown cells during ipynb → IR conversion.
tests/_convert/ipynb/test_strip_paragraph_tags.py Adds focused unit tests for paragraph tag stripping behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +117 to +119
# Clean up excessive blank lines introduced by replacements
source = re.sub(r"\n{3,}", "\n\n", source)
return source.strip()
Comment on lines +61 to +64
Only bare ``<p>`` tags (without attributes) are removed. Styled tags such
as ``<p style="color: red">`` are preserved because they carry semantic
meaning. The matching ``</p>`` is only removed when it closes a bare
``<p>``.
Comment on lines +21 to +36
def test_preserves_styled_p_tag(self) -> None:
"""Styled <p> tags carry semantic meaning and must be preserved."""
source = '<p style="color: red">Red text</p>'
result = _strip_paragraph_tags(source)
assert '<p style="color: red">' in result
assert "</p>" in result

def test_preserves_p_with_class(self) -> None:
result = _strip_paragraph_tags('<p class="lead">Styled text</p>')
assert '<p class="lead">' in result
assert "</p>" in result

def test_preserves_p_with_id(self) -> None:
result = _strip_paragraph_tags('<p id="intro">Intro text</p>')
assert '<p id="intro">' in result

Copy link
Collaborator

@dmadisetti dmadisetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preference to not use regex to strip HTML. I think we can use a markdown preprocessor to handle code blocks and the builtin HTMLParser accordingly.

But maybe a larger change, and this is fine

Replace hand-rolled _FENCED_BLOCK_RE regex with RE_NESTED_FENCE_START
from pymdownx.superfences, which is already used elsewhere in the
codebase (marimo/_convert/markdown/to_ir.py) and is a battle-tested
dependency of marimo.

As suggested by @dmadisetti in review.

Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Jupyter notebook importer leaves stray <p> </p> HTML tags

4 participants