Chapter 7

Python + Word — Batch Document Generation and Processing

Chapter 7: Python Word Automation — Batch Contracts, Reports & Notices

HR needs to send performance notices to hundreds of employees at year-end. Legal needs to generate batches of uniformly formatted contracts. Sales needs personalized quotes in Word format for every client. All of this repetitive work takes Python just minutes. This chapter covers the document structure of python-docx, the docxtpl template engine for batch generation, and advanced topics including image insertion, table manipulation, and Word-to-PDF conversion.

python-docx Basics: Document Structure

Installation

Terminal

pip install python-docx docxtpl

The Four-Level Hierarchy

Document: The .docx file itself — the root container for everything
Paragraph: Documents are made of paragraphs — body text, headings, list items
Run: A contiguous stretch of text within a paragraph that shares the same formatting. One paragraph can have many runs — each run can have independent font, color, and bold settings
Table: Rows and cells; each cell contains paragraphs

Key concept: The Run is the smallest text unit in a Word document. "Hello World" might be one Run or two separate Runs ("Hello " and "World") depending on whether they share identical formatting. This matters enormously when doing search-and-replace.

Creating a Document with Formatting

create_basic_doc.py

from docx import Document
from docx.shared import Pt, RGBColor, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH

doc = Document()

# Heading
heading = doc.add_heading('Performance Notice', level=1)
heading.alignment = WD_ALIGN_PARAGRAPH.CENTER

# Paragraph with styled run
p = doc.add_paragraph()
run = p.add_run('Dear John Smith,')
run.font.size = Pt(12)
run.font.bold = True
run.font.color.rgb = RGBColor(0x1F, 0x2D, 0x3D)

# Body paragraph
p2 = doc.add_paragraph(
    'Following the annual performance review, your rating has been '
    'assessed as Grade A. Congratulations.'
)
p2.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
p2.paragraph_format.space_before = Pt(6)
p2.paragraph_format.space_after = Pt(12)
p2.paragraph_format.line_spacing = Pt(20)

doc.save('notice.docx')

Common Formatting Quick Reference

Operation	Code
Bold	`run.font.bold = True`
Italic	`run.font.italic = True`
Underline	`run.font.underline = True`
Font size (pt)	`run.font.size = Pt(14)`
Font color	`run.font.color.rgb = RGBColor(255, 0, 0)`
Font name	`run.font.name = 'Arial'`
Center align	`p.alignment = WD_ALIGN_PARAGRAPH.CENTER`
Right align	`p.alignment = WD_ALIGN_PARAGRAPH.RIGHT`

Template + Data Fill: Batch Generation with docxtpl

python-docx is great for building documents from scratch, but controlling every style detail via code is tedious. The professional approach: design your document in Word (with company branding, signatures, seals), add Jinja2 template tags for variable content, then use docxtpl to fill in data and generate hundreds of documents automatically.

Template Syntax

{{"{{"}}name{{"}}"}} — output a variable
{%p if condition %}...{%p endif %} — conditional paragraphs
{%tr for item in list %}...{%tr endfor %} — loop over table rows

Word may split your tags across multiple Runs. If you type {{"{{"}}name{{"}}"}} and Word's autocorrect fires, the braces may end up in separate Runs internally, breaking template rendering. Disable autocorrect or use docxtpl's built-in template validation to check for split tags.

Case: Batch Generate 100 Employee Notices from Excel

batch_notice.py

import pandas as pd
from docxtpl import DocxTemplate
from pathlib import Path
from datetime import date

df = pd.read_excel('employees.xlsx')
# Expected columns: Name, Department, Grade

tpl = DocxTemplate('notice_template.docx')

output_dir = Path('output/notices')
output_dir.mkdir(parents=True, exist_ok=True)

today = date.today().strftime('%B %d, %Y')
success_count = 0

for _, row in df.iterrows():
    context = {
        'name':       row['Name'],
        'department': row['Department'],
        'grade':      row['Grade'],
        'date':       today,
    }
    tpl.render(context)
    filename = output_dir / f"{row['Name']}_notice.docx"
    tpl.save(filename)
    success_count += 1
    print(f'  Generated: {filename.name}')

print(f'\nDone! {success_count} notices saved to {output_dir}')

Performance: 100 Word documents in 3–8 seconds. The same task done manually by a skilled employee takes at least 2–3 hours. This is the compounding value of automation.

Advanced Document Operations

Inserting Images with Size Control

insert_image.py

from docx import Document
from docx.shared import Inches, Cm
from docx.enum.text import WD_ALIGN_PARAGRAPH

doc = Document()

# Insert image with width (height auto-scales to preserve aspect ratio)
doc.add_picture('logo.png', width=Inches(2))

# Center the image (it lives inside a paragraph)
last_paragraph = doc.paragraphs[-1]
last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER

# Chart with explicit dimensions
doc.add_picture('chart.png', width=Cm(12), height=Cm(8))

doc.save('with_images.docx')

Table Operations

table_operations.py

from docx import Document
from docx.shared import Pt

doc = Document()
doc.add_heading('Sales Report', level=1)

# Create table
table = doc.add_table(rows=1, cols=4)
table.style = 'Table Grid'

# Header row
headers = ['Product', 'Units Sold', 'Unit Price', 'Revenue']
for i, text in enumerate(headers):
    cell = table.rows[0].cells[i]
    cell.text = text
    for run in cell.paragraphs[0].runs:
        run.font.bold = True
        run.font.size = Pt(11)

# Data rows
data = [
    ('Laptop',        120, 999,  119880),
    ('Wireless Mouse',350, 29,   10150),
    ('Keyboard',      200, 79,   15800),
]
for row_data in data:
    row = table.add_row()
    for i, value in enumerate(row_data):
        row.cells[i].text = str(value)

doc.save('sales_report.docx')

Headers and Footers with Page Numbers

header_footer.py

from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml import OxmlElement
from docx.oxml.ns import qn

doc = Document()
section = doc.sections[0]

# Header
header_para = section.header.paragraphs[0]
header_para.text = 'CONFIDENTIAL — Internal Use Only'
header_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
header_para.runs[0].font.size = Pt(9)

# Footer with page number field
footer_para = section.footer.paragraphs[0]
footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER

def add_page_number(paragraph):
    run = paragraph.add_run()
    for tag, ftype in [('begin', None), ('instrText', 'PAGE'), ('end', None)]:
        if tag == 'instrText':
            el = OxmlElement('w:instrText')
            el.text = ftype
        else:
            el = OxmlElement('w:fldChar')
            el.set(qn('w:fldCharType'), tag)
        run._r.append(el)

footer_para.add_run('Page ')
add_page_number(footer_para)

doc.add_paragraph('Document body content goes here.')
doc.save('with_header_footer.docx')

Merging Multiple Documents

merge_docs.py

from docx import Document
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
from pathlib import Path

def merge_documents(file_list, output_path):
    merged = Document()
    for element in list(merged.element.body):
        merged.element.body.remove(element)

    for i, filepath in enumerate(file_list):
        sub_doc = Document(filepath)
        for element in sub_doc.element.body:
            merged.element.body.append(element)
        # Page break between documents (not after last)
        if i < len(file_list) - 1:
            para = merged.add_paragraph()
            run = para.add_run()
            br = OxmlElement('w:br')
            br.set(qn('w:type'), 'page')
            run._r.append(br)

    merged.save(output_path)
    print(f'Merged {len(file_list)} files -> {output_path}')

doc_files = sorted(Path('output/notices').glob('*.docx'))
merge_documents(doc_files, 'all_notices_merged.docx')

Search & Replace

The Run-Split Problem

The most common pitfall in Word automation: when you type {{"{{"}}name{{"}}"}} into a Word document, Word may internally store it across three separate Runs — {{, name, }}. A naive paragraph.text.replace() call operates on individual Run text, not the joined string, so the replacement silently fails.

safe_replace.py — Cross-run safe replacement

from docx import Document

def replace_in_paragraph(paragraph, old_text, new_text):
    """Replace text across runs safely: join, replace, rewrite."""
    full_text = ''.join(run.text for run in paragraph.runs)
    if old_text not in full_text:
        return
    new_full = full_text.replace(old_text, new_text)
    if paragraph.runs:
        paragraph.runs[0].text = new_full
        for run in paragraph.runs[1:]:
            run.text = ''

def replace_in_document(doc, replacements: dict):
    # Body paragraphs
    for paragraph in doc.paragraphs:
        for old, new in replacements.items():
            replace_in_paragraph(paragraph, old, new)
    # Table cells
    for table in doc.tables:
        for row in table.rows:
            for cell in row.cells:
                for paragraph in cell.paragraphs:
                    for old, new in replacements.items():
                        replace_in_paragraph(paragraph, old, new)
    # Headers and footers
    for section in doc.sections:
        for paragraph in section.header.paragraphs:
            for old, new in replacements.items():
                replace_in_paragraph(paragraph, old, new)

doc = Document('contract_template.docx')
replace_in_document(doc, {
    '{{"{{"}}company{{"}}"}}':    'Acme Corporation',
    '{{"{{"}}client{{"}}"}}':     'GlobalTrade Ltd',
    '{{"{{"}}amount{{"}}"}}':     '$128,000',
    '{{"{{"}}start_date{{"}}"}}': 'January 1, 2025',
})
doc.save('contract_filled.docx')

Prefer docxtpl for new projects. The replace-in-paragraph function above misses text inside text boxes and some edge cases. docxtpl handles all of these internally. Write your own replacement logic only when you cannot modify the template file's format.

Word to PDF Conversion

Windows: win32com (Best Fidelity)

word_to_pdf_windows.py

import win32com.client
from pathlib import Path

def word_to_pdf_windows(docx_path, pdf_path=None):
    docx_path = Path(docx_path).resolve()
    pdf_path = Path(pdf_path or docx_path.with_suffix('.pdf')).resolve()

    word = win32com.client.Dispatch('Word.Application')
    word.Visible = False
    try:
        doc = word.Documents.Open(str(docx_path))
        doc.SaveAs(str(pdf_path), FileFormat=17)  # 17 = wdFormatPDF
        doc.Close()
        print(f'Converted: {pdf_path.name}')
    finally:
        word.Quit()

for docx_file in Path('output/notices').glob('*.docx'):
    word_to_pdf_windows(docx_file)

Mac / Linux: LibreOffice CLI

word_to_pdf_libreoffice.py

import subprocess
from pathlib import Path

def word_to_pdf_libreoffice(docx_path, output_dir=None):
    """
    Requires LibreOffice installed.
    Mac:    brew install --cask libreoffice
    Ubuntu: sudo apt install libreoffice
    """
    docx_path = Path(docx_path).resolve()
    output_dir = Path(output_dir or docx_path.parent)
    cmd = ['libreoffice', '--headless',
           '--convert-to', 'pdf',
           '--outdir', str(output_dir),
           str(docx_path)]
    result = subprocess.run(cmd, capture_output=True, text=True)
    if result.returncode == 0:
        print(f'Converted: {docx_path.stem}.pdf')
    else:
        print(f'Failed: {result.stderr}')

Cross-Platform: docx2pdf

word_to_pdf_cross.py

# pip install docx2pdf
from docx2pdf import convert

# Single file
convert('contract_filled.docx', 'contract_filled.pdf')

# Entire directory
convert('output/notices/')  # converts all .docx files to .pdf

docx2pdf dependencies: On Windows it uses Word COM (requires Microsoft Word). On Mac it uses Word for Mac. On Linux it falls back to LibreOffice. Without the underlying software installed, it will raise an error.

Project: Contract Batch Generation System

Generate customized contracts for 50 clients from an Excel list, archive by client name, and produce PDF copies alongside each Word file.

contract_batch_system.py — Complete system (~80 lines)

"""
Contract Batch Generation System
Requirements: pip install python-docx docxtpl pandas openpyxl docx2pdf
"""
import pandas as pd
from docxtpl import DocxTemplate
from pathlib import Path
from datetime import date
import logging

logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

TEMPLATE_PATH = 'templates/contract_template.docx'
DATA_PATH     = 'data/clients.xlsx'
OUTPUT_ROOT   = Path('output/contracts')
CONVERT_PDF   = True

def load_client_data(path: str) -> pd.DataFrame:
    df = pd.read_excel(path)
    required = ['Client Name', 'Contact', 'Amount', 'Term', 'Sign Date']
    missing = [c for c in required if c not in df.columns]
    if missing:
        raise ValueError(f'Excel missing columns: {missing}')
    logger.info(f'Loaded {len(df)} client records')
    return df

def generate_contract(tpl: DocxTemplate, row: pd.Series, output_dir: Path):
    client_name = row['Client Name']
    client_dir = output_dir / client_name
    client_dir.mkdir(parents=True, exist_ok=True)

    context = {
        'client_name':  client_name,
        'contact':      row['Contact'],
        'amount':       f"${row['Amount']:,.2f}",
        'term':         row['Term'],
        'sign_date':    str(row['Sign Date'])[:10],
        'today':        date.today().strftime('%B %d, %Y'),
        'contract_no':  f"CT-{date.today().year}-{row.name+1:04d}",
    }

    tpl.render(context)
    docx_path = client_dir / f"{client_name}_Contract.docx"
    tpl.save(docx_path)

    if CONVERT_PDF:
        try:
            from docx2pdf import convert
            convert(str(docx_path), str(client_dir / f"{client_name}_Contract.pdf"))
            logger.info(f'  [OK] {client_name} — Word + PDF')
        except Exception as e:
            logger.warning(f'  [PDF failed] {client_name}: {e}')
    else:
        logger.info(f'  [OK] {client_name} — Word')

def main():
    OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
    df = load_client_data(DATA_PATH)
    tpl = DocxTemplate(TEMPLATE_PATH)
    errors = []

    for _, row in df.iterrows():
        try:
            generate_contract(tpl, row, OUTPUT_ROOT)
        except Exception as e:
            errors.append((row.get('Client Name', 'Unknown'), str(e)))
            logger.error(f'  [Error] {row.get("Client Name")}: {e}')

    success = len(df) - len(errors)
    logger.info(f'\nDone: {success} succeeded, {len(errors)} failed')
    logger.info(f'Output: {OUTPUT_ROOT.resolve()}')

if __name__ == '__main__':
    main()

System highlights: Template and data fully decoupled — non-technical staff can update the Word template independently Per-client directory structure makes archiving and retrieval easy Auto-generated contract numbers (year + sequence) Robust error handling — one client failure does not stop the rest PDF conversion failure is logged as a warning, not a crash

Previous

Next
Chapter 8: PDF Automation

Rate this chapter

4.6 / 5 (44 ratings)