PDFInfo vs. Other Tools: Which Extracts PDF Data Best?

How to Use PDFInfo to Read PDF Metadata FastPDFs often hide useful metadata—title, author, creation date, page count, PDF version and more—that can speed up document management, search, and automation tasks. PDFInfo is a lightweight command-line utility (part of the Xpdf and Poppler toolsets) that quickly extracts this metadata without opening the full file in a graphical reader. This article shows what PDFInfo returns, how to install and use it, and practical tips to read metadata fast across many files.


What PDFInfo shows

PDFInfo prints common PDF metadata and document-level attributes. Typical output fields:

  • Title — document title embedded by the authoring tool
  • Author — creator name
  • Subject — document subject or summary
  • Keywords — indexed keywords (if present)
  • Creator — application that created the PDF (e.g., “LibreOffice”)
  • Producer — library or app that produced the PDF (e.g., “GPL Ghostscript”)
  • CreationDate — when the PDF was created
  • ModDate — last modification date
  • Tagged — whether the PDF contains accessibility tags
  • Pages — number of pages
  • Encrypted — whether the PDF is encrypted/has restrictions
  • Page size — dimensions of the first page (width x height)
  • File size — bytes on disk
  • PDF version — major.minor PDF spec version (like 1.4, 1.7)

Not all PDFs contain every field; some metadata is optional or stripped by processing.


Installing PDFInfo

PDFInfo is included in Xpdf and Poppler packages. Install using your platform package manager.

  • On Debian/Ubuntu:

    sudo apt update sudo apt install poppler-utils 
  • On Fedora:

    sudo dnf install poppler-utils 
  • On macOS (Homebrew):

    brew install poppler 
  • On Windows: download Poppler or Xpdf binaries and add the folder to PATH, or use a package manager like Chocolatey:

    choco install poppler 

After installation, confirm availability:

pdfinfo -v 

Basic usage

Run pdfinfo followed by the PDF filename:

pdfinfo example.pdf 

Sample output:

Title:          Sample Document Author:         Jane Doe Producer:       LibreOffice 7.3 CreationDate:   Tue Jan 16 10:24:00 2024 ModDate:        Tue Jan 16 10:30:00 2024 Tagged:         no Pages:          12 Encrypted:      no File size:      24896 bytes PDF version:    1.4 

Fast extraction for specific fields

To read only a few fields without parsing full output, use command-line text tools. Examples assume a Unix-like shell.

  • Extract the page count:

    pdfinfo file.pdf | grep ^Pages: | awk '{print $2}' 
  • Get the title:

    pdfinfo file.pdf | sed -n 's/^Title:[[:space:]]*//p' 
  • Get creation date:

    pdfinfo file.pdf | grep ^CreationDate: | cut -d: -f2- 

For more concise output, you can use awk to map fields:

pdfinfo file.pdf | awk -F': ' '/^(Title|Author|Pages|CreationDate|ModDate)/{print $1": "$2}' 

Batch processing many PDFs

Process directories quickly by looping or using parallel tools.

  • Simple loop (Bash):

    for f in /path/to/pdfs/*.pdf; do echo "File: $(basename "$f")" pdfinfo "$f" | grep -E '^(Title|Author|Pages):' done 
  • Produce CSV with filename, pages, title:

    echo "filename,pages,title" > pdf_metadata.csv for f in *.pdf; do pages=$(pdfinfo "$f" | awk -F': ' '/^Pages/ {print $2}') title=$(pdfinfo "$f" | sed -n 's/^Title:[[:space:]]*//p' | tr ',' ' ') echo "$(basename "$f"),$pages,"$title"" >> pdf_metadata.csv done 
  • Faster parallel processing with GNU parallel:

    ls *.pdf | parallel -j+0 'pdfinfo {} | awk -F": " "/^Pages/ {p=$2} /^Title/ {t=$2} END{print "{},"p",\""t"\""}"' >> pdf_metadata.csv 

Handling encrypted or corrupted PDFs

  • If a PDF is encrypted, pdfinfo shows “Encrypted: yes” and may prompt for a password. You can supply a password with:
    
    pdfinfo -upw PASSWORD file.pdf 
  • Corrupted or malformed PDFs might cause pdfinfo to fail; run qpdf or other repair tools first:
    
    qpdf --linearize broken.pdf fixed.pdf 

Using PDFInfo programmatically

From scripts or applications, call pdfinfo and parse stdout. In Python:

import subprocess def get_pdfinfo(path):     out = subprocess.check_output(['pdfinfo', path], text=True)     info = {}     for line in out.splitlines():         if ':' in line:             k, v = line.split(':', 1)             info[k.strip()] = v.strip()     return info meta = get_pdfinfo('example.pdf') print(meta.get('Pages'), meta.get('Title')) 

For large-scale systems, consider using a native PDF library (PyPDF2, pdfminer.six, pikepdf) if you need deeper or more structured metadata access.


Common pitfalls and tips

  • PDFInfo reads metadata embedded in the PDF; it won’t infer metadata from content. Use OCR or text-extraction tools if metadata is missing.
  • Titles and authors may contain non-UTF-8 encodings; handle byte decoding carefully in scripts.
  • Some PDFs have multiple metadata sources (Info dictionary, XMP). pdfinfo generally prefers the Info dictionary; XMP may hold richer info.
  • To avoid accidental execution issues, always quote filenames in scripts, especially those with spaces.

Alternatives and when to use them

  • GUI tools: Adobe Reader, Okular — convenient for individual files.
  • Libraries: PyPDF2, pikepdf, pdfminer.six — better for programmatic manipulation and advanced metadata (XMP).
  • Other CLI: exiftool can read and write PDF metadata and XMP fields; it’s useful when pdfinfo misses XMP data.

Comparison table:

Tool Best for Notes
pdfinfo Quick CLI metadata read Very fast, part of poppler/xpdf
exiftool Deep metadata/XMP Reads/writes many metadata fields
PyPDF2 / pikepdf Programmatic access Modify metadata, extract text/pages
GUI readers Manual inspection User-friendly, one-off checks

Example workflows

  • Quick check before archiving:

    1. Run pdfinfo to confirm page count and creation date.
    2. Add missing Title/Author using exiftool or a library.
    3. Store CSV index for search.
  • Bulk inventory:

    1. Use parallel + pdfinfo to extract pages, titles, authors.
    2. Deduplicate by file size + page count.
    3. Flag encrypted or missing-title files for manual review.

Conclusion

PDFInfo is a fast, dependable command-line tool for extracting basic PDF metadata. Use it directly for quick checks or integrate it into scripts for batch processing. For richer metadata or modification needs, complement pdfinfo with exiftool or a PDF library.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *