fix: strip stray <p></p> HTML tags from imported Jupyter markdown cells#8674
fix: strip stray <p></p> HTML tags from imported Jupyter markdown cells#8674giulio-leone wants to merge 5 commits intomarimo-team:mainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
All contributors have signed the CLA ✍️ ✅ |
There was a problem hiding this comment.
Pull request overview
This PR updates the Jupyter (.ipynb) importer to normalize markdown cell sources by removing HTML paragraph wrappers (<p ...> / </p>) before converting them into mo.md(...), improving LaTeX rendering for notebooks that embed math inside <p> tags.
Changes:
- Add a
_strip_paragraph_tags()helper (and compiled regex) to remove<p>wrappers from markdown cell source. - Apply the stripping step during
convert_from_ipynb_to_notebook_ir()before callingmarkdown_to_marimo().
Comments suppressed due to low confidence (1)
marimo/_convert/ipynb/to_ir.py:1431
- This new import-time normalization changes markdown cell semantics, but there’s no targeted test asserting the intended behavior (strip
<p>wrappers, preserve other HTML, and avoid breaking code fences). Since the repo has an established ipynb importer test suite (e.g. tests/_convert/ipynb/test_ipynb_to_ir.py), please add a unit test covering representative<p>-wrapped markdown cells.
if is_markdown:
source = _strip_paragraph_tags(source)
cell_meta = cell.get("metadata", {})
md_prefix = cell_meta.get("marimo", {}).get(
"md_prefix", DEFAULT_MARKDOWN_PREFIX
)
source = markdown_to_marimo(source, prefix=md_prefix)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
marimo/_convert/ipynb/to_ir.py
Outdated
| _PARAGRAPH_TAG_RE = re.compile( | ||
| r"<p(?:\s[^>]*)?>|</p>", | ||
| re.IGNORECASE, | ||
| ) | ||
|
|
||
|
|
||
| def _strip_paragraph_tags(source: str) -> str: | ||
| """Remove bare ``<p>`` / ``</p>`` HTML tags from markdown source. | ||
|
|
||
| Jupyter markdown cells often wrap content in ``<p>…</p>`` tags which are | ||
| redundant in plain markdown and can break LaTeX rendering inside | ||
| ``mo.md()``. This helper removes them while preserving all other HTML | ||
| tags and the text content. | ||
| """ | ||
| return _PARAGRAPH_TAG_RE.sub("", source) | ||
|
|
||
|
|
marimo/_convert/ipynb/to_ir.py
Outdated
| ``mo.md()``. This helper removes them while preserving all other HTML | ||
| tags and the text content. | ||
| """ | ||
| return _PARAGRAPH_TAG_RE.sub("", source) | ||
|
|
||
|
|
There was a problem hiding this comment.
these are good suggestions and we can add test cases for each one
There was a problem hiding this comment.
Thanks @mscolnick! I'll add test cases for these edge cases — specifically:
- Preserving
<p>tags inside fenced code blocks - Handling adjacent paragraph tags (
<p>a</p><p>b</p>) with proper separation
Will push an update shortly.
mscolnick
left a comment
There was a problem hiding this comment.
thanks for the contribution! could we add some tests for this? could we also check that it does not strip <p style="color: red"> since that is a real change to output.
|
Added tests and addressed reviewer feedback: Improvements to
12 test cases added in
All tests verified locally. |
Jupyter notebooks often wrap paragraph text in <p>...</p> tags which Jupyter renders natively. When converting to marimo, these tags are kept verbatim inside mo.md() where they interfere with LaTeX rendering — LaTeX delimiters inside <p> elements are not processed by the markdown/math renderer. Add a _strip_paragraph_tags() helper that removes bare <p>/<p> tags (including those with attributes like <p class='...'>) before passing the markdown source to markdown_to_marimo(). Other HTML tags (<div>, <span>, <em>, <strong>, etc.) are preserved. Closes marimo-team#8651
Address reviewer suggestions from @mscolnick and @Copilot: - Skip <p>/<\/p> tags inside fenced code blocks (backtick and tilde) - Replace </p> with newline to preserve paragraph separation - Add 12 test cases covering: basic removal, attributes, case sensitivity, adjacent paragraph separation, fenced code block preservation (backtick and tilde), multiline, passthrough, empty string, LaTeX, nested HTML
for more information, see https://pre-commit.ci
a9f69ea to
e8185f0
Compare
Address review feedback from mscolnick: - Change regex to only match bare <p> (no attributes) - Styled tags like <p style="color: red"> are preserved - Pair-matching ensures </p> is only stripped when closing a bare <p> - Add tests for styled tag preservation, mixed bare+styled, adjacent styled
|
Updated the implementation based on the review feedback: Changes
Tests added
|
cf23997 to
f8f3e26
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves the Jupyter notebook (.ipynb) importer by preprocessing markdown cell content to remove problematic HTML paragraph wrappers that interfere with downstream markdown/LaTeX rendering in mo.md().
Changes:
- Add
_strip_paragraph_tags()to remove bare<p>...</p>wrappers from imported markdown source (with fenced code block protection). - Apply paragraph-tag stripping during markdown cell conversion in
convert_from_ipynb_to_notebook_ir(). - Add unit tests covering paragraph stripping behavior and fenced-code-block preservation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| marimo/_convert/ipynb/to_ir.py | Introduces _strip_paragraph_tags() and applies it to markdown cells during ipynb → IR conversion. |
| tests/_convert/ipynb/test_strip_paragraph_tags.py | Adds focused unit tests for paragraph tag stripping behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Clean up excessive blank lines introduced by replacements | ||
| source = re.sub(r"\n{3,}", "\n\n", source) | ||
| return source.strip() |
| Only bare ``<p>`` tags (without attributes) are removed. Styled tags such | ||
| as ``<p style="color: red">`` are preserved because they carry semantic | ||
| meaning. The matching ``</p>`` is only removed when it closes a bare | ||
| ``<p>``. |
| def test_preserves_styled_p_tag(self) -> None: | ||
| """Styled <p> tags carry semantic meaning and must be preserved.""" | ||
| source = '<p style="color: red">Red text</p>' | ||
| result = _strip_paragraph_tags(source) | ||
| assert '<p style="color: red">' in result | ||
| assert "</p>" in result | ||
|
|
||
| def test_preserves_p_with_class(self) -> None: | ||
| result = _strip_paragraph_tags('<p class="lead">Styled text</p>') | ||
| assert '<p class="lead">' in result | ||
| assert "</p>" in result | ||
|
|
||
| def test_preserves_p_with_id(self) -> None: | ||
| result = _strip_paragraph_tags('<p id="intro">Intro text</p>') | ||
| assert '<p id="intro">' in result | ||
|
|
dmadisetti
left a comment
There was a problem hiding this comment.
Preference to not use regex to strip HTML. I think we can use a markdown preprocessor to handle code blocks and the builtin HTMLParser accordingly.
But maybe a larger change, and this is fine
Replace hand-rolled _FENCED_BLOCK_RE regex with RE_NESTED_FENCE_START from pymdownx.superfences, which is already used elsewhere in the codebase (marimo/_convert/markdown/to_ir.py) and is a battle-tested dependency of marimo. As suggested by @dmadisetti in review. Signed-off-by: giulio-leone <giulio97.leone@gmail.com>
Summary
When importing Jupyter notebooks (e.g. SM_sphere_S2.ipynb), markdown cells containing
<p>…</p>HTML paragraph wrappers are kept verbatim insidemo.md(). This breaks LaTeX rendering because the markdown/math renderer does not process LaTeX delimiters that appear inside HTML<p>elements.Example
Jupyter markdown cell:
Before (broken in marimo): The
<p>tags prevent the LaTeX formula from rendering.After (this fix): Converted to plain text:
Fix
Added a
_strip_paragraph_tags()helper inmarimo/_convert/ipynb/to_ir.pythat removes bare<p>/</p>tags (including those with attributes like<p class="...">) from markdown cell source before passing it tomarkdown_to_marimo().Other HTML tags (
<div>,<span>,<em>,<strong>, etc.) are preserved — only paragraph wrappers are stripped since they are redundant in plain markdown.Closes #8651