Narrate Any PDF

Human-like narration of your PDFs, omitting content not meant to be read aloud.

Listen To Our Demos And See The Difference

Experience a new class of Text to Speech that understands what it is saying.

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Abstract
Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis.

The Surprising Effectiveness of Test-Time Training for Abstract Reasoning

Abstract
Language models have shown impressive performance on tasks within their training distribution, but often struggle with novel problems requiring complex reasoning. We investigate the effectiveness of test-time training (TTT)—updating model parameters temporarily during inference using a loss derived from input data—a mechanism for improving models reasoning capabilities, using the Abstraction and Reasoning Corpus (ARC) as a benchmark. Through systematic experimentation, we identify three crucial components for successful TTT: (1) initial finetuning on similar tasks (2) auxiliary task format and augmentations (3) per-instance training. TTT significantly improves performance on ARC tasks, achieving up to 6x improvement in accuracy compared to base fine-tuned models; applying TTT to an 8B-parameter language model, we achieve 53% accuracy on the ARC's public validation set, improving the state-of-the-art by nearly 25% for public and purely neural approaches. By ensembling our method with recent program generation approaches, we get SoTA public validation accuracy of 61.9% matching the average human score. Our findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in neural language models; additional test-time applied to continued training on few-shot examples can also be extremely effective.

Adding Conditional Control to Text-to-Image Diffusion Models

Abstract
We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.

How It Works

Our AI uses Vision to intelligently filter each page and generates natural-sounding narration using state-of-the-art speech models.

Upload Your PDF

Vision AI Generates Transcript

Speech AI Generates Narration

Don't Just Take Our Word For It

"This is incredible for my research papers! It skips all the citations and footnotes automatically."

- David R.

"As someone with ADHD, this has been a game changer. The natural-sounding voice keeps me engaged and I love that it focuses on the main content."

- Rachel M.

"I use this for all my textbook readings now. The voice is incredibly natural and it's so smart about what to read vs skip."

- Sarah B.

"Finally! An AI narrator that doesn't sound robotic. This makes getting through dense papers so much easier."

- Thomas W.

"The audio quality is outstanding. Way better than any other PDF reader I've tried."

- Michael K.

"That's a brilliant use of AI to make information more accesible!"

- Jennifer P.

"I'm a grad student and this has revolutionized how I get through my reading list. The natural voice makes dense material much more digestible."

- Emma L.

"Perfect for my commute - I can finally make productive use of my drive time with actually enjoyable audio."

- Daniel S.

"The smart content filtering is brilliant. No more listening to citation numbers or footnotes!"

- James T.

"Been using this for my law school readings. Game changing for getting through dense legal documents."

- Nicole C.

"As someone with dyslexia, this tool has been life-changing. The natural voice makes it so much easier to absorb information."

- Alex P.

"I'm amazed at how well it handles complex academic papers. It knows exactly what to read and what to skip."

- Marcus H.

Frequently Asked Questions

Have another question? Send an email to hi@narratemypdf.com