Workflow template
Extract PDF Tables and Figures Into a Spreadsheet
Pull numbers from invoices, statements, and reports in your folder into a clean, column-mapped spreadsheet you can actually work with.
Copy-paste prompt
Read all the PDF files in this folder. For each PDF, extract every table or structured set of numbers you find. Map the data into a spreadsheet called extracted-data.xlsx. Create one sheet per PDF, named after the source file. Within each sheet, preserve the original column headers as closely as possible. If a PDF has multiple distinct tables, stack them with a blank row and a label (e.g., 'Table 2: Operating Expenses') between them. In a separate sheet called Summary, list each source file, how many tables were found, the date on the document if present, and any values or rows you flagged as unclear or potentially misread. Do not round or modify numbers. If a number is ambiguous, put it in the Summary flags column rather than guessing.
PDFs are designed for reading, not for working with the data inside them. A bank statement in PDF form looks perfectly presentable and is useless for analysis until you get the numbers out. Doing that by hand, column by column, is slow and error-prone. Cowork can read through a folder of PDFs and pull the tables into a spreadsheet in a few minutes. You still need to check that the numbers came through correctly.
What Belongs in the Folder
Gather the PDFs you want to extract from. Common cases:
- Vendor invoices (line items, quantities, prices, totals)
- Bank or credit card statements (transaction lists, running balances)
- Supplier price lists
- Annual reports or financial statements with tables of figures
- Any internal report that was exported to PDF and now needs the numbers back out
One folder per extraction job works well. If you are doing monthly vendor invoices, one folder per month keeps the resulting spreadsheets organized. If you are doing a one-off extraction from several different report types, keep them together and rely on the per-sheet naming to stay organized.
The Prompt
With Cowork pointed at the folder:
Read all the PDF files in this folder. For each PDF, extract every table or structured set of numbers you find. Map the data into a spreadsheet called extracted-data.xlsx. Create one sheet per PDF, named after the source file. Within each sheet, preserve the original column headers as closely as possible. If a PDF has multiple distinct tables, stack them with a blank row and a label (e.g., 'Table 2: Operating Expenses') between them. In a separate sheet called Summary, list each source file, how many tables were found, the date on the document if present, and any values or rows you flagged as unclear or potentially misread. Do not round or modify numbers. If a number is ambiguous, put it in the Summary flags column rather than guessing.
“Do not round or modify numbers” is a load-bearing instruction. Without it, Cowork may normalize number formats in ways that lose precision, for instance, turning $1,234.50 into 1234.5 or rounding a percentage from 12.37% to 12.4%. The instruction keeps the extracted values faithful to the source.
How Cowork Handles Multi-Page PDFs
Cowork reads a PDF as a whole document, not page by page. This means a table that starts on page 4 and continues on page 5 will typically be read as one continuous table. The same applies to headers that repeat at the top of each page in a long statement. Cowork usually recognizes repeated headers as formatting rather than separate data rows, but this is worth checking. If you see a set of column headers appearing as a data row in the middle of a sheet, that is where a page boundary tripped up the extraction.
For very long PDFs (reports that run forty or fifty pages), the extraction is generally reliable but slow. If you are running several of these at once, give Cowork a few extra minutes before checking the output.
The Summary Sheet
The Summary sheet is the first thing to look at after the extraction finishes. It lists every source file, the number of tables found, the document date, and any flags. A flag might say:
- “Row 14, column ‘Amount’: value appeared as ‘$1O,234’ (possible OCR confusion between 0 and O)”
- “Page 7 table header not clearly identifiable, used generic column names Col1, Col2, Col3”
- “Document date not found”
Any flagged row needs manual verification. Open the original PDF and find the flagged value. If the PDF is text-based (you can select text in it), the extraction is probably right and the flag is the cautious thing. If the PDF is a scan, the flag may indicate a genuine misread.
Spot-Checking the Numbers
The Summary sheet tells you where Cowork was uncertain. Beyond that, pick a sample of rows to check even when there are no flags.
A practical approach: for each source PDF, verify the column totals. If the original statement has a total at the bottom, compare it to a SUM formula applied to the extracted column. If they match to the cent, the extraction for that column is clean. If they do not match, work through the column looking for discrepancies. Common causes: a row was skipped, a negative number was extracted as positive, or a subtotal row was included in a column that should only contain line items.
For invoices specifically, also check the line-item count. Count the rows in the extracted sheet and compare to the number of line items on the original PDF. A mismatch usually means a row was dropped or duplicated, which the total check might not catch if the dropped row was zero.
Column Mapping Decisions
The prompt asks Cowork to preserve original column headers. In practice this means the headers in your spreadsheet will reflect whatever the PDF used. If you are combining data across multiple PDFs and the headers differ (one invoice says “Unit Price,” another says “Price Each”), the sheets will have inconsistent column names.
To normalize this, add a line to the prompt: “Use these standard column names where applicable: [your list]. Map each PDF’s columns to the nearest equivalent from this list.” This produces more consistent output at the cost of a judgment call by Cowork about which columns map to which. Review the Summary sheet to see how it handled ambiguous mappings.
When the PDF Is a Scan
A scanned PDF is an image of a document rather than text you can select. Cowork reads it visually, which is impressive but imperfect. The main failure modes are:
Numbers that look similar in print: 0 and O, 1 and l, 5 and S. These get flagged when Cowork is uncertain, but not always.
Skewed or dark scans where characters bleed together. A cell that reads “12,345” in a clean PDF might read “12,3$5” in a poor scan.
Tables with thin or missing gridlines, where Cowork infers column structure from spacing rather than visible borders.
For scanned PDFs, the spot-check step is not optional. Verify totals column by column, and be especially careful about any number large enough that a misread digit would matter.
Frequently asked questions
What kinds of PDFs work best?
Text-based PDFs (where you can select and copy text) extract cleanly. Scanned PDFs (images of paper documents) also work but are more error-prone, especially for small print or low-resolution scans. Always spot-check scanned output more carefully.
What if the PDF has no clear table structure, just paragraphs with numbers in them?
Add to the prompt: 'If there is no table structure, extract any numeric data mentioned in the text and list it in two columns: Description and Value.' Cowork will do its best, but the output will need more cleanup than a structured table.
Can Cowork handle multi-page tables that span across pages?
Usually yes. It reads the full PDF, not page by page, so a table that continues across two or three pages typically comes through intact. Very long tables (twenty-plus pages) occasionally have alignment issues worth checking.
What if two PDFs have the same table structure? Can they go in one sheet instead of two?
Change the prompt to say so: 'If multiple PDFs have the same column structure, combine them into one sheet with a column indicating the source file.' This is useful for monthly statements from the same vendor.