Python + Word — Batch Document Generation and Processing
Chapter 7: Python Word Automation — Batch Contracts, Reports & Notices
HR needs to send performance notices to hundreds of employees at year-end. Legal needs to generate batches of uniformly formatted contracts. Sales needs personalized quotes in Word format for every client. All of this repetitive work takes Python just minutes. This chapter covers the document structure of python-docx, the docxtpl template engine for batch generation, and advanced topics including image insertion, table manipulation, and Word-to-PDF conversion.
python-docx Basics: Document Structure
Installation
Terminal
pip install python-docx docxtpl
The Four-Level Hierarchy
- Document: The .docx file itself — the root container for everything
- Paragraph: Documents are made of paragraphs — body text, headings, list items
- Run: A contiguous stretch of text within a paragraph that shares the same formatting. One paragraph can have many runs — each run can have independent font, color, and bold settings
- Table: Rows and cells; each cell contains paragraphs
Key concept: The Run is the smallest text unit in a Word document. "Hello World" might be one Run or two separate Runs ("Hello " and "World") depending on whether they share identical formatting. This matters enormously when doing search-and-replace.
Creating a Document with Formatting
create_basic_doc.py
from docx import Document
from docx.shared import Pt, RGBColor, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
doc = Document()
# Heading
heading = doc.add_heading('Performance Notice', level=1)
heading.alignment = WD_ALIGN_PARAGRAPH.CENTER
# Paragraph with styled run
p = doc.add_paragraph()
run = p.add_run('Dear John Smith,')
run.font.size = Pt(12)
run.font.bold = True
run.font.color.rgb = RGBColor(0x1F, 0x2D, 0x3D)
# Body paragraph
p2 = doc.add_paragraph(
'Following the annual performance review, your rating has been '
'assessed as Grade A. Congratulations.'
)
p2.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
p2.paragraph_format.space_before = Pt(6)
p2.paragraph_format.space_after = Pt(12)
p2.paragraph_format.line_spacing = Pt(20)
doc.save('notice.docx')
Common Formatting Quick Reference
| Operation | Code |
|---|---|
| Bold | run.font.bold = True |
| Italic | run.font.italic = True |
| Underline | run.font.underline = True |
| Font size (pt) | run.font.size = Pt(14) |
| Font color | run.font.color.rgb = RGBColor(255, 0, 0) |
| Font name | run.font.name = 'Arial' |
| Center align | p.alignment = WD_ALIGN_PARAGRAPH.CENTER |
| Right align | p.alignment = WD_ALIGN_PARAGRAPH.RIGHT |
Template + Data Fill: Batch Generation with docxtpl
python-docx is great for building documents from scratch, but controlling every style detail via code is tedious. The professional approach: design your document in Word (with company branding, signatures, seals), add Jinja2 template tags for variable content, then use docxtpl to fill in data and generate hundreds of documents automatically.
Template Syntax
{{"{{"}}name{{"}}"}}— output a variable{%p if condition %}...{%p endif %}— conditional paragraphs{%tr for item in list %}...{%tr endfor %}— loop over table rows
Word may split your tags across multiple Runs. If you type
{{"{{"}}name{{"}}"}}and Word's autocorrect fires, the braces may end up in separate Runs internally, breaking template rendering. Disable autocorrect or use docxtpl's built-in template validation to check for split tags.
Case: Batch Generate 100 Employee Notices from Excel
batch_notice.py
import pandas as pd
from docxtpl import DocxTemplate
from pathlib import Path
from datetime import date
df = pd.read_excel('employees.xlsx')
# Expected columns: Name, Department, Grade
tpl = DocxTemplate('notice_template.docx')
output_dir = Path('output/notices')
output_dir.mkdir(parents=True, exist_ok=True)
today = date.today().strftime('%B %d, %Y')
success_count = 0
for _, row in df.iterrows():
context = {
'name': row['Name'],
'department': row['Department'],
'grade': row['Grade'],
'date': today,
}
tpl.render(context)
filename = output_dir / f"{row['Name']}_notice.docx"
tpl.save(filename)
success_count += 1
print(f' Generated: {filename.name}')
print(f'\nDone! {success_count} notices saved to {output_dir}')
Performance: 100 Word documents in 3–8 seconds. The same task done manually by a skilled employee takes at least 2–3 hours. This is the compounding value of automation.
Advanced Document Operations
Inserting Images with Size Control
insert_image.py
from docx import Document
from docx.shared import Inches, Cm
from docx.enum.text import WD_ALIGN_PARAGRAPH
doc = Document()
# Insert image with width (height auto-scales to preserve aspect ratio)
doc.add_picture('logo.png', width=Inches(2))
# Center the image (it lives inside a paragraph)
last_paragraph = doc.paragraphs[-1]
last_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
# Chart with explicit dimensions
doc.add_picture('chart.png', width=Cm(12), height=Cm(8))
doc.save('with_images.docx')
Table Operations
table_operations.py
from docx import Document
from docx.shared import Pt
doc = Document()
doc.add_heading('Sales Report', level=1)
# Create table
table = doc.add_table(rows=1, cols=4)
table.style = 'Table Grid'
# Header row
headers = ['Product', 'Units Sold', 'Unit Price', 'Revenue']
for i, text in enumerate(headers):
cell = table.rows[0].cells[i]
cell.text = text
for run in cell.paragraphs[0].runs:
run.font.bold = True
run.font.size = Pt(11)
# Data rows
data = [
('Laptop', 120, 999, 119880),
('Wireless Mouse',350, 29, 10150),
('Keyboard', 200, 79, 15800),
]
for row_data in data:
row = table.add_row()
for i, value in enumerate(row_data):
row.cells[i].text = str(value)
doc.save('sales_report.docx')
Headers and Footers with Page Numbers
header_footer.py
from docx import Document
from docx.shared import Pt
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
doc = Document()
section = doc.sections[0]
# Header
header_para = section.header.paragraphs[0]
header_para.text = 'CONFIDENTIAL — Internal Use Only'
header_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
header_para.runs[0].font.size = Pt(9)
# Footer with page number field
footer_para = section.footer.paragraphs[0]
footer_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
def add_page_number(paragraph):
run = paragraph.add_run()
for tag, ftype in [('begin', None), ('instrText', 'PAGE'), ('end', None)]:
if tag == 'instrText':
el = OxmlElement('w:instrText')
el.text = ftype
else:
el = OxmlElement('w:fldChar')
el.set(qn('w:fldCharType'), tag)
run._r.append(el)
footer_para.add_run('Page ')
add_page_number(footer_para)
doc.add_paragraph('Document body content goes here.')
doc.save('with_header_footer.docx')
Merging Multiple Documents
merge_docs.py
from docx import Document
from docx.oxml.ns import qn
from docx.oxml import OxmlElement
from pathlib import Path
def merge_documents(file_list, output_path):
merged = Document()
for element in list(merged.element.body):
merged.element.body.remove(element)
for i, filepath in enumerate(file_list):
sub_doc = Document(filepath)
for element in sub_doc.element.body:
merged.element.body.append(element)
# Page break between documents (not after last)
if i < len(file_list) - 1:
para = merged.add_paragraph()
run = para.add_run()
br = OxmlElement('w:br')
br.set(qn('w:type'), 'page')
run._r.append(br)
merged.save(output_path)
print(f'Merged {len(file_list)} files -> {output_path}')
doc_files = sorted(Path('output/notices').glob('*.docx'))
merge_documents(doc_files, 'all_notices_merged.docx')
Search & Replace
The Run-Split Problem
The most common pitfall in Word automation: when you type {{"{{"}}name{{"}}"}} into a Word document, Word may internally store it across three separate Runs — {{, name, }}. A naive paragraph.text.replace() call operates on individual Run text, not the joined string, so the replacement silently fails.
safe_replace.py — Cross-run safe replacement
from docx import Document
def replace_in_paragraph(paragraph, old_text, new_text):
"""Replace text across runs safely: join, replace, rewrite."""
full_text = ''.join(run.text for run in paragraph.runs)
if old_text not in full_text:
return
new_full = full_text.replace(old_text, new_text)
if paragraph.runs:
paragraph.runs[0].text = new_full
for run in paragraph.runs[1:]:
run.text = ''
def replace_in_document(doc, replacements: dict):
# Body paragraphs
for paragraph in doc.paragraphs:
for old, new in replacements.items():
replace_in_paragraph(paragraph, old, new)
# Table cells
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
for old, new in replacements.items():
replace_in_paragraph(paragraph, old, new)
# Headers and footers
for section in doc.sections:
for paragraph in section.header.paragraphs:
for old, new in replacements.items():
replace_in_paragraph(paragraph, old, new)
doc = Document('contract_template.docx')
replace_in_document(doc, {
'{{"{{"}}company{{"}}"}}': 'Acme Corporation',
'{{"{{"}}client{{"}}"}}': 'GlobalTrade Ltd',
'{{"{{"}}amount{{"}}"}}': '$128,000',
'{{"{{"}}start_date{{"}}"}}': 'January 1, 2025',
})
doc.save('contract_filled.docx')
Prefer docxtpl for new projects. The replace-in-paragraph function above misses text inside text boxes and some edge cases. docxtpl handles all of these internally. Write your own replacement logic only when you cannot modify the template file's format.
Word to PDF Conversion
Windows: win32com (Best Fidelity)
word_to_pdf_windows.py
import win32com.client
from pathlib import Path
def word_to_pdf_windows(docx_path, pdf_path=None):
docx_path = Path(docx_path).resolve()
pdf_path = Path(pdf_path or docx_path.with_suffix('.pdf')).resolve()
word = win32com.client.Dispatch('Word.Application')
word.Visible = False
try:
doc = word.Documents.Open(str(docx_path))
doc.SaveAs(str(pdf_path), FileFormat=17) # 17 = wdFormatPDF
doc.Close()
print(f'Converted: {pdf_path.name}')
finally:
word.Quit()
for docx_file in Path('output/notices').glob('*.docx'):
word_to_pdf_windows(docx_file)
Mac / Linux: LibreOffice CLI
word_to_pdf_libreoffice.py
import subprocess
from pathlib import Path
def word_to_pdf_libreoffice(docx_path, output_dir=None):
"""
Requires LibreOffice installed.
Mac: brew install --cask libreoffice
Ubuntu: sudo apt install libreoffice
"""
docx_path = Path(docx_path).resolve()
output_dir = Path(output_dir or docx_path.parent)
cmd = ['libreoffice', '--headless',
'--convert-to', 'pdf',
'--outdir', str(output_dir),
str(docx_path)]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
print(f'Converted: {docx_path.stem}.pdf')
else:
print(f'Failed: {result.stderr}')
Cross-Platform: docx2pdf
word_to_pdf_cross.py
# pip install docx2pdf
from docx2pdf import convert
# Single file
convert('contract_filled.docx', 'contract_filled.pdf')
# Entire directory
convert('output/notices/') # converts all .docx files to .pdf
docx2pdf dependencies: On Windows it uses Word COM (requires Microsoft Word). On Mac it uses Word for Mac. On Linux it falls back to LibreOffice. Without the underlying software installed, it will raise an error.
Project: Contract Batch Generation System
Generate customized contracts for 50 clients from an Excel list, archive by client name, and produce PDF copies alongside each Word file.
contract_batch_system.py — Complete system (~80 lines)
"""
Contract Batch Generation System
Requirements: pip install python-docx docxtpl pandas openpyxl docx2pdf
"""
import pandas as pd
from docxtpl import DocxTemplate
from pathlib import Path
from datetime import date
import logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)
TEMPLATE_PATH = 'templates/contract_template.docx'
DATA_PATH = 'data/clients.xlsx'
OUTPUT_ROOT = Path('output/contracts')
CONVERT_PDF = True
def load_client_data(path: str) -> pd.DataFrame:
df = pd.read_excel(path)
required = ['Client Name', 'Contact', 'Amount', 'Term', 'Sign Date']
missing = [c for c in required if c not in df.columns]
if missing:
raise ValueError(f'Excel missing columns: {missing}')
logger.info(f'Loaded {len(df)} client records')
return df
def generate_contract(tpl: DocxTemplate, row: pd.Series, output_dir: Path):
client_name = row['Client Name']
client_dir = output_dir / client_name
client_dir.mkdir(parents=True, exist_ok=True)
context = {
'client_name': client_name,
'contact': row['Contact'],
'amount': f"${row['Amount']:,.2f}",
'term': row['Term'],
'sign_date': str(row['Sign Date'])[:10],
'today': date.today().strftime('%B %d, %Y'),
'contract_no': f"CT-{date.today().year}-{row.name+1:04d}",
}
tpl.render(context)
docx_path = client_dir / f"{client_name}_Contract.docx"
tpl.save(docx_path)
if CONVERT_PDF:
try:
from docx2pdf import convert
convert(str(docx_path), str(client_dir / f"{client_name}_Contract.pdf"))
logger.info(f' [OK] {client_name} — Word + PDF')
except Exception as e:
logger.warning(f' [PDF failed] {client_name}: {e}')
else:
logger.info(f' [OK] {client_name} — Word')
def main():
OUTPUT_ROOT.mkdir(parents=True, exist_ok=True)
df = load_client_data(DATA_PATH)
tpl = DocxTemplate(TEMPLATE_PATH)
errors = []
for _, row in df.iterrows():
try:
generate_contract(tpl, row, OUTPUT_ROOT)
except Exception as e:
errors.append((row.get('Client Name', 'Unknown'), str(e)))
logger.error(f' [Error] {row.get("Client Name")}: {e}')
success = len(df) - len(errors)
logger.info(f'\nDone: {success} succeeded, {len(errors)} failed')
logger.info(f'Output: {OUTPUT_ROOT.resolve()}')
if __name__ == '__main__':
main()
System highlights: Template and data fully decoupled — non-technical staff can update the Word template independently Per-client directory structure makes archiving and retrieval easy Auto-generated contract numbers (year + sequence) Robust error handling — one client failure does not stop the rest PDF conversion failure is logged as a warning, not a crash
Previous
Next
Chapter 8: PDF Automation