What advantage does this have over normal algorithmic ways of turning HTML to Markdown ?

#5
by MohamedRashad - opened

I don't understand why would i use this instead of going directly to a simple tool that will convert my HTML to Markdown. What advantages will i see here ?

Jina AI org

I hope this post will answer your question https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-html-to-markdown-and-json

TL;DR: the structure of HTML is reserved well, and excelling at generating complex elements like code fences, nested lists, tables and LaTex equations.

I think it's a great model to use in the future. I understand that for now the algorithmic way of extracting html wins but I think they are demonstrating the capabilities of what an LLMs could do without the algorithm.

I liked the model, do you plan to extract the dataset from html to markdown and json?

Thank you very much.

I also do not see the benefit of such model over simple hand-coded algorithm. Most HTML data sources require navigation and clicking on boxes, forms and buttons to generate useful content, which this model does not help with in any way? Also the license is bad.

A good model will have the intelligence to know how to navigate a web site to get the information it was asked for, then call a tool to convert it to markdown/json, or generate code in a targeted language(typically not Python) for executing the extraction end-to-end.

Jina AI org

Totally agree @hrstoyanov , thanks for your opinions. This model cannot work for your concerns. But we need to be aware that this work is tailored for a better tool of converting static HTML to markdown/json. We always need a good converting tool.

I kicked the tires on this model using the demo code provided. It works pretty well for hacker news site like the API demo shows. However, as a user of local models, I decided not to use it in my agent pipeline for a few reasons:

  1. It takes about ~12GB VRAM depending on how much context. Not a big deal, but I would have to unload my llama.cpp model while running it. Not horrible because I have plenty of disk cache, but a pain to orchestrate.
  2. It seems a bit slow? I noticed it only uses ~250W out of my 3090TI FE's 450W power limit. It pegs it to 100% but seems kind of inefficient compared to llama.cpp which uses all 450W at 100%. It can take a few minutes to convert a typical website's raw html even after stripping a lot of junk tags (like <script> etc).
  3. It can still hallucinate inaccurate information. Not specific problem with your model, true of all LLMs.

So either you could use the paid API option if you want something working fast, or my solution for now is:

  1. Use a good general purpose CoT model like DeepSeek-R1-Distill-Qwen-32B-Q4_K_M with 16k cache and keep it loaded in ~24GB VRAM.
  2. Use good old beautiful soup and strip most of the tags while preserving links as markdown and soup.get_text(separator=" ", strip=True)
  3. Pass that text to the LLM and ask it to build a markdown representation of the info.

Here is an example output from my quick test: https://gist.github.com/ubergarm/054dd60924ed5f86649ca8603ff8e49b

Perhaps a bigger model, some optimizations, and more pre-scrubbing would work better?

Anyway, cheers and thanks for the models!

Sign up or log in to comment