MarkItDown utility and LLMs are great match

Microsoft released in the mid of December 2024 an open-source utility called MarkItDown. This utility converts files like PDF, JPG, JSON, XML, HTML, ZIP, and Office documents (PowerPoint, Word, Excel) into Markdown format. You can find a full list of supported file formats from here.

Markdown itself is a lightweight markup language for creating formatted plain-text documents. It's a standardized, widely used, simple, and clean format. Markdown has been used for years, especially in various note-taking tools such as Obsidian, Microsoft Loop, etc.

Why is this relevant right now?

As you know, Generative AI and LLM/RAG applications have been top of everyone's mind during the year. In the previous blog post, we learned that RAG is an effective method for enhancing a language model's output by combining it with relevant data sources.

Markdown format-based files are excellent data sources, particularly for LLM (Large Language Model) and RAG (Retrieval-Augmented Generation) applications. This is because LLMs can easily understand and utilize Markdown structures, such as headings and lists, during tasks like summarization. Additionally, formatting elements like bold and italics provide semantic cues that indicate the importance and type of the text.

Markdown is a suitable option also from a processing perspective because the format is simple and lightweight.

Dhaval Nagar has written a good article about why LLMs love structure:

LLMs Love Structure: Using Markdown for Better PDF Analysis - APPGAMBiT
There’s a strong case for using Markdown text over plain text when working with LLMs and structured data extracted from PDFs. Data formatted with Markdown provides structure and context which is generally lost with plain text formatting. For Retrieval-Augmented Generation (RAG) based applications, data available in markdown text format holds significant importance.

How to use this locally?

I'll show how to use this tool in VS Code using Jupyter notebooks.

Prerequisites

Jupyter Notebook

  • Create a new Code cell and include the following command which installs the MarkItDown library.
pip install markitdown

Test with Office files

  • Create a new Code cell and include the following Python function
from markitdown import MarkItDown

def ConvertToMarkdown(fileName):
  md = MarkItDown()
  result = md.convert(fileName)
  print(result.text_content)
  • Create a new Code cell and call the function with a filename parameter.
ConvertToMarkdown("file_example_PPT_250kB.pptx")
💡
You can download various test files from here.

I used this file in this test. After execution all content and notes are converted to Markdown.

<!-- Slide number: 1 -->
# Lorem ipsum
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio. Vestibulum neque massa, scelerisque sit amet ligula eu, congue molestie mi. Praesent ut varius sem. Nullam at porttitor arcu, nec lacinia nisi. Ut ac dolor vitae odio interdum condimentum. Vivamus dapibus sodales ex, vitae malesuada ipsum cursus convallis. Maecenas sed egestas nulla, ac condimentum orci. Mauris diam felis, vulputate ac suscipit et, iaculis non est. Curabitur semper arcu ac ligula semper, nec luctus nisl blandit. Integer lacinia ante ac libero lobortis imperdiet. Nullam mollis convallis ipsum, ac accumsan nunc vehicula vitae. Nulla eget justo in felis tristique fringilla. Morbi sit amet tortor quis risus auctor condimentum. Morbi in ullamcorper elit. Nulla iaculis tellus sit amet mauris tempus fringilla.

Maecenas mauris lectus, lobortis et purus mattis, blandit dictum tellus. Maecenas non lorem quis tellus placerat varius. Nulla facilisi. Aenean congue fringilla justo ut aliquam. Mauris id ex erat. Nunc vulputate neque vitae justo facilisis, non condimentum ante sagittis. Morbi viverra semper lorem nec molestie. Maecenas tincidunt est efficitur ligula euismod, sit amet ornare est vulputate.

### Notes:

<!-- Slide number: 2 -->
# Chart

### Notes:

Test with Image files

MarkItDown can extract information from image files, but it needs LLM to generate the descriptions. Let's include the next Azure OpenAI client to the previous function.

  • Create a new Code cell and install the OpenAI Python client library
pip install openai
  • Create a new Code cell and set the following environment variables.
%env AZURE_OPENAI_API_KEY=[INSERT AZURE OPENAI API KEY]
%env AZURE_OPENAI_ENDPOINT=[INSERT AZURE OPENAI ENDPOINT]
%env AZURE_OPENAI_MODEL=gpt-4o
  • Create a new Code cell and modify the existing Python function.
import os
from openai import AzureOpenAI
from markitdown import MarkItDown

def ConvertToMarkdown(fileName):

  client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-01",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )
  md = MarkItDown(llm_client=client, llm_model=os.getenv("AZURE_OPENAI_MODEL"))
  result = md.convert(fileName)
  print(result.text_content)
  • Create a new Code cell and call the function with a filename parameter.
ConvertToMarkdown("receipt.jpg")

I used this receipt image file:

End result as a Markdown:

# Description:
"Digital receipt for two bestselling IT and DevOps-focused novels purchased on November 17, 2023. The order includes 'The Unicorn Project' (€11,97) and 'The Phoenix Project' (€10,37), both written by Gene Kim, with contributions from Kevin Behr and George Spafford in the latter. These Kindle editions, sold by Amazon EU S.à r.l., explore themes of digital transformation and business agility. The total cost, including €2,03 in sales tax, comes to €22,34. Perfect reads for tech enthusiasts and professionals aiming to understand IT revolution and modern DevOps practices."

How to utilize this in Azure Cloud?

Azure and Python make for an excellent combination! 👌 Azure has several services available that support Python. MarkItDown can be easily integrated in pre-processing phases, where files are converted to Markdown for text indexing and analysis.

Check these Azure PaaS services which support Python:

Comments