Automated Email Invoice Processing

→ Introduction

A client of Maxedv, an IT consulting company, handling large volumes of invoices via email needed to automate the capturing, extraction, and organization of these documents to improve operational efficiency.

The core idea behind the project was to:

Automatically download invoice attachments received via email using an IMAP connection.
Extract key invoice data reliably using AWS Textract OCR.
Organize files locally, renaming them according to a strict predefined convention: [Owner]_[DocumentType]_[Provider]_[InvoiceNumber].[Extension].

From the beginning, the focus was on reducing manual time expenditure, minimizing data entry errors, and improving overall operational efficiency and cost-effectiveness.

✕ Challenges

The main challenge was to eliminate a highly inefficient and error-prone manual email and invoice processing workflow, while navigating complex technical hurdles in data extraction and system reliability.

Process Inefficiencies:

Administrative employees manually opened emails, downloaded attachments, and entered data.
The manual process had a high error susceptibility ("Fehleranfälligkeit") when sorting and renaming files.
High time expenditure ("Zeitaufwand") directly affected costs and efficiency ("Kosten und Effizienz").

Technical Hurdles:

IMAP State Management: Tracking the exact inbox state so the script wouldn't process the same emails twice, even after a system reboot or crash. Tracking total email counts proved unreliable.
Data Extraction Consistency (The Regex Limitation): Initially, extracting the invoice number and vendor name relied on complex Regex and local Python libraries (like pdfplumber, EasyOCR, and Pytesseract). However, this approach struggled with varying invoice table formats, scanned images, and high CPU usage. Relying on the email sender's name for the vendor was also discarded as it required constant manual maintenance of a vendor list.
Fuzzy Name Matching: Finding the exact internal owner ("Eigentümer") name on the invoice was difficult due to slight typos or variations in the document text compared to the internal database.

✓ Solution

To address these challenges, we designed and implemented a standalone Python application compiled with Pyinstaller that fully automates the workflow in 5-minute cycles.

The solution is built around three core components:

🖥️ Reliable IMAP Connection & Tracking

This component acts as the automated interface to the email server.

UID Tracking: Instead of counting emails, the system tracks the highest processed UID. This ensures that even if the computer restarts, no emails are processed twice and no new emails are missed.
Resilience: Features an automatic retry mechanism (up to 5 times) to handle temporary internet drops, keeping the background process alive.
Non-Destructive: Emails are left in their original read/unread state after processing.

⚙️ Data Extraction Engine (AWS Textract & S3)

To overcome the limitations of local OCR and Regex, we migrated the extraction logic to AWS Textract (Analyze Expense).

S3 Integration: To bypass the 5MB and 1-page limits of direct byte-transfers to Textract, files are first uploaded asychronously to an AWS S3 bucket. This allows multi-page documents to be processed accurately at a highly cost-effective rate (€10 per 1000 pages).
Structured JSON Parsing: Textract returns a fully structured JSON with summary fields, eliminating the need for complex, brittle Regex rules. We extract the exact INVOICE_RECEIPT_ID and vendor name robustly, regardless of whether it's a PDF or a scanned PNG.
Fuzzy Matching: We implemented rapidfuzz to associate the extracted names with the correct internal owners (Eigentümer) in a local JSON database. This calibrates the similarity percentage, perfectly mitigating slight OCR errors or name variations.
Semantic Classification: Distinguishes between invoices (Rechnung, e.g., RG) and delivery notes (Lieferschein, e.g., LS).

📊 Automated File Organization

Once the data is accurately extracted, the local file system is structured automatically:

Standardized file renaming string: [Owner(Verw Nr)]_[DocumentType(RG/LI)]_[VendorName]_[InvoiceNumber].[Extension].
Files are organized locally and then moved to designated "completed" folders (e.g., Re_erledig), ready for the accounting software.

★ Business Results

This project significantly modernized the technical infrastructure and created measurable business value with immediate ROI.

Results for the Client

The solution was seamlessly deployed in a Windows environment without complex configurations.

80% reduction in processing time (from 5 minutes to 1 minute per invoice).
~6.6 hours saved per week, directly improving productivity and cost efficiency.
Financial impact: Based on a standard administrative base cost of €30/hour in Hamburg, this translates to savings of €198 per week and ~€10,296 annually.
Drastic reduction in error rates, ensuring standardized file naming and highly accurate data association.

Fully automated processes include:

Continuous inbox monitoring and attachment downloading via IMAP.
Reliable OCR data extraction and AI analysis via AWS Textract.
Intelligent directory organization and standardized file renaming.

Results for End Users / Employees

Administrative employees now benefit from a centralized and simplified workflow.

Zero-friction workflow: Everything runs autonomously in the background via a remote console.
Reduction of human error and fatigue by eliminating repetitive manual sorting and renaming.
Better focus: Employees can dedicate their recovered time to core competencies and productive tasks.

Users benefit from:

Real-time invisible invoice processing.
Centralized and structured local folder organization.
High reliability and bilingual document recognition for mixed-language invoices.

↻ Project Timeline

From project kick-off to production deployment, development took approximately 80 working hours.

Phase 1 - Discovery & MVP

Initial attempts using local OCR (pdfplumber, Pytesseract) and complex Regex.

Phase 2 - Architecture Pivot (AWS)

Shifted to AWS Textract (Analyze Expense) for structured JSON data extraction.

Phase 3 - S3 Integration & Fuzzing

Added asynchronous S3 uploading to bypass Textract's byte limits for multi-page documents.

Phase 4 - Hardening & Launch

Implemented UID IMAP tracking, automated retry loops, unit testing with pytest, and compiled the final standalone executable using **Pyinstaller** for deployment in the end-user environment.

Ready to transform your product?

Let's discuss how we can achieve similar results for your business. Contact me at paniagua.ian.de@gmail.com or book a call below.