Large Language Models (LLMs) are revolutionizing various fields, but their success hinges on the quality and consistency of their training data. Uniformity matters when it comes to text data – especially when dealing with the wild jungle of document formats like PDFs, presentations (PPTs), and HTML. This blog post explores strategies for extracting content from these diverse formats and converting them into a unified JSON structure for streamlined LLM training.

The Challenge of Mixed Document Formats:
Imagine feeding an LLM a menagerie of text sources: a bulleted list from a presentation, a research paper in PDF format, and a news article riddled with HTML tags. The LLM might struggle to understand the structure, meaning, and relationships within this information. Inconsistencies in formatting can lead to biases in the training data, impacting the model’s performance and potentially introducing unfair advantages based on document type.
Normalization and Content Extraction: Building the Foundation
Before diving into JSON conversion, it’s crucial to address content normalization. Normalization techniques ensure consistency in the extracted text, reducing noise and inconsistencies that might confuse the LLM.
This can involve:
- Handling Case Sensitivity: Converting everything to lowercase or uppercase eliminates inconsistencies arising from capitalization styles.
- Stemming and Lemmatization: Reducing words to their root form (stemming) or base form (lemmatization) can help the LLM focus on core meaning rather than grammatical variations.
- Punctuation and Special Character Management: Removing unnecessary symbols or converting them to standardized forms can improve processing efficiency.
- Handling Outliers and Noise: Techniques like spell checking, removing stop words (common words like “the” or “a”), or filtering out irrelevant content can help mitigate the impact of typos, grammatical errors, or irrelevant data.
Once the content is normalized, we can proceed with document-specific extraction methods.
Extracting Content from Diverse Formats:
It’s simple to breakdown content extraction techniques for common document formats, leveraging open source libraries. Lets look at a HTML document and extract a title from the metadata for being used in the LLM as an text input.
HTML extraction example using beautifulSoup library:
from bs4 import BeautifulSoup
def extract_html(html_content):
soup = BeautifulSoup(html_content, 'html.parser')
text = soup.get_text(separator='\n')
metadata = {"source_format": "HTML"}
return {"title": soup.title.string if soup.title else "", "text": text, "metadata": metadata}
PPT extraction example using pptx library:
from pptx import Presentation
def extract_pptx(pptx_file):
presentation = Presentation(pptx_file)
text = ""
for slide in presentation.slides:
for shape in slide.shapes:
if shape.has_text:
text += shape.text + "\n"
metadata = {"source_format": "PPTX"}
return {"title": presentation.slide_master.title.text if presentation.slide_master.title else "", "text": text, "metadata": metadata}
PDF extraction example using py2pdf2 library:
import PyPDF2
def extract_pdf(pdf_file):
pdf_reader = PyPDF2.PdfReader(pdf_file)
text = ""
for page in pdf_reader.pages:
text += page.extract_text() + "\n"
metadata = {"source_format": "PDF"}
return {"title": pdf_reader.documentInfo.get("/Title"), "text": text, "metadata": metadata}
Now putting all these together, let’s get into how we can get the document_json which will be an input to the LLM.
def process_document(document_path, format):
if format == "HTML":
return extract_html(open(document_path).read())
elif format == "PPTX":
return extract_pptx(document_path)
elif format == "PDF":
return extract_pdf(document_path)
else:
raise ValueError(f"Unsupported document format: {format}")
# Example usage
document_json = process_document("my_document.pdf", "PDF")
print(json.dumps(document_json, indent=2))
Now, this script calls the appropriate extraction function based on the document format and builds a unified JSON structure for all documents. so the document_json can be passed along with the required prompt metadata to the LLM.
By converting extracted text to JSON, we gain several advantages for LLM training when this would be the input. First, JSON offers a clear and standardized format, ensuring consistency across different document types. This consistency makes it easier for the LLM to understand and process the data. Additionally, JSON is flexible, allowing us to include extra information beyond just the text itself. For further analysis, we could store details like the original document format (PDF, PPTX, HTML) or timestamps. This flexibility adds context and richness to the data, potentially improving LLM performance. Are you interested in more of such pipeline implementation posts? Please do let me know in the comments below.