Mastering AWS Textract: AI-Powered Document Extraction

May 29, 2025
10
min read

Introduction

Image Source: aws.amazon.com

Amazon Textract is a fully managed machine learning service that automatically extracts printed or handwritten text and structured data from scanned documents and images. Built on Amazon’s proven, scalable deep-learning technology, Textract goes beyond simple optical character recognition (OCR) by identifying text, forms, tables, and even handwriting without any manual configuration. In practice, this means businesses can ingest PDFs, images, or scans of invoices, contracts, forms, IDs, and more, and have Textract return JSON data with all the detected text blocks and data elements. Textract supports both synchronous (single-page, low-latency) and asynchronous (multi-page, batch) processing modes, integrating with S3, SNS, SQS, and Lambda for scalable workflows.

Textract’s easy-to-use API abstracts away the complexity of vision models. Users simply call the Textract API on their document (stored in S3 or passed as bytes) and receive structured output. The service continuously learns from new data, and Amazon regularly adds features, so Textract’s capabilities keep expanding. In this deep-dive guide, we’ll explain what AWS Textract is, explore its core architecture and processing flow, show why and when to use it, detail its key features (forms, tables, queries, expense, ID, and more), cover pricing (including free tier and cost examples), flag limitations/gotchas, share best practices, and provide a quick-start tutorial. We’ll also compare Textract with alternatives and related AWS services (Comprehend, Azure Document Intelligence, Google Document AI). By the end, you’ll have everything you need to leverage AWS Textract effectively in your document-processing workflows.

What Is AWS Textract?

Image Source: aws.amazon.com

AWS Textract is an AI-powered document text and data extraction service. In plain language, Textract "reads" documents and images and returns the text and data contained within them. It's a fully managed ML service - there's no infrastructure to set up or models to train. You simply supply a file and call the Textract API. The service can detect and extract typed text, handwriting, and data in various document types, including PDFs, scanned images, ID cards, forms, tables, and receipts, with a single API call. Textract automatically handles complex document layouts: it identifies lines of text, words, forms (key-value fields), tables, paragraphs, titles, footers, and more. For example, Textract's new Layout feature groups words into paragraphs, headers, and titles in reading order, a task that otherwise required custom post-processing.

Under the hood, Textract uses Amazon's computer-vision deep learning models (related to those used in Rekognition) to analyze images. According to the AWS Developer Guide, Textract is "based on the same proven, highly scalable, deep-learning technology that was developed by Amazon's computer vision scientists to analyze billions of images and videos daily". This means Textract can handle large volumes of documents in parallel. You don't need ML expertise; the service provides simple API operations.

Processing flow

Textract offers two main modes. For low-latency, single-page documents or small jobs, you use synchronous APIs (DetectDocumentText or AnalyzeDocument) and get results in the same response. For multi-page PDFs, very large documents, or when you want to run processing asynchronously, you use asynchronous APIs (like StartDocumentTextDetection or StartDocumentAnalysis), which return a Job ID. You then poll or subscribe to an SNS topic/SQS queue for job completion and call the corresponding "Get" API (e.g. GetDocumentTextDetection) to retrieve the results. The diagram below illustrates the async workflow: you upload the document to S3, call a Start operation, Textract processes the document, posts completion to SNS, and then you retrieve the JSON result with a Get call:

Image Source: aws.amazon.com

The result JSON from Textract is a list of "Block" objects – each block represents a word, line, table cell, key-value field, or other element, along with coordinates and confidence scores. Your application parses these blocks to obtain the text or data you need. Because Textract is integrated with AWS IAM, S3, Lambda, SNS/SQS, and other services, you can easily embed it in serverless pipelines or data processing workflows.

Why & When to Use It

Textract shines when you need to automate the extraction of text and data from documents or images, replacing slow, error-prone manual processes. For example, banks and financial services often receive mortgage or loan application forms in scanned images. Traditionally, employees would re-type applicant names, income, loan terms, etc., into systems – a tedious, costly task. With Textract, you simply upload the scanned applications and call the Textract APIs. Textract returns structured data like applicant name, mortgage rate, and invoice total, all in minutes. One AWS case study notes that Textract can help process loan and mortgage applications in minutes by automatically extracting fields such as applicant names and mortgage rates. This eliminates manual data entry, accelerates decision-making, and reduces errors.

Other real-world examples abound: Healthcare providers can extract patient information from intake forms and insurance claims to speed up onboarding and claims processing. Retailers can automatically read and record data from invoices, receipts, or delivery notes. Government agencies use Textract to process tax forms or business applications faster. Any business that deals with scanned documents or PDFs of forms – HR paperwork, surveys, legal contracts, tax documents, etc. – can benefit. Textract essentially adds an OCR and document-understanding layer to your applications without having to build your ML models.

Textract is especially compelling when scale, accuracy, and no manual ML development are important. Since Textract is managed, you pay only per page processed (no licenses or upfront costs), and Amazon handles scaling. There’s no fixed fee – if you analyze 100 pages one month and 1 million the next, you only pay for those pages. This makes Textract cost-effective for both sporadic use and massive processing. In summary, use Textract whenever you have documents that need text/data extraction and you want to leverage AWS’s ML models without developing custom AI solutions.

Key Features & How They Work

Image Source: aws.amazon.com

AWS Textract offers a range of document analysis features. Rather than only returning raw OCR text, it can identify structure and specific data types. Here are the main features and how to use them:

  • DetectDocumentText (Simple OCR): This basic API uses OCR to detect all text and handwriting in an input document. It returns the words and lines of text along with geometry (bounding boxes) and confidence scores. Use this when you only need raw text and don’t care about forms or tables. For example, a scanned report or letter. This feature is inexpensive (just $0.0015 per page for the first million pages in US regions ) and covers many languages. The output is a set of blocks (type PAGE, LINE, WORD) in the response JSON.
  • AnalyzeDocument – Forms & Tables: This API (analyzedocument) has sub-features:
    • Forms (Key-Value Pairs): Extracts form data as key-value pairs. For example, if the document is an application form with labels like “Full Name” or “Email,” Textract returns those keys with their corresponding values. This feature also includes OCR of all text.
    • Tables: Recognizes tables and their rows/columns. It returns each table cell and how cells relate to each other. If your document has a spreadsheet or tabular data (like a financial report table), Textract preserves the table structure.
    • Layout (Paragraphs & Headings): A new feature (announced late 2023) that groups words into higher-level layout elements such as paragraphs, titles, subtitles, headers, and footers. For example, instead of just lines of text, Textract can tell you that certain lines form the “Title” and others form a “Paragraph.” This is useful for downstream tasks like building document search indexes or NLP.
    • Signatures: Textract can also detect signature marks. The Signatures option finds handwritten or electronic signatures on a form. It returns the position and confidence of any signatures it detects. (This is handy for automating checks like “did the applicant sign the consent form?”)
    You can request any combination of these features in a single AnalyzeDocument call. However, note that pricing is additive: if you select multiple features, the cost includes each. For example, if a page has both tables ($0.015/page) and forms ($0.05/page), you effectively pay $0.065 for that page. (This can multiply if you also use queries, etc. See Pricing below.) The output JSON includes blocks for each element: Form fields become blocks with BlockType=KEY_VALUE_SET, tables become TABLE, rows, cells, etc.
  • Queries (AnalyzeDocument): Textract’s Queries feature lets you ask custom questions about a document and get answers. You define up to 15 queries per page (synchronous) or 30 (asynchronous). For example, you could ask “What is the invoice total?” or “What is the date of birth?” and Textract returns just the answer text. Queries use context and location if provided, and are useful when you need specific data fields. For best results, phrase queries naturally using words from the document, and if the document has sections with similar fields, specify the section (e.g., “What is the SSN for Borrower?”). Queries cost $0.015 per page (for Forms+Tables+Queries) in US regions.
  • Expense Analysis (AnalyzeExpense): This specialized API is tuned for receipts and invoices. It is a synchronous operation that extracts fields such as vendor name, total amount, line-item details (quantity, price), dates, taxes, etc. Customers in accounting/finance use this to automate accounts payable. The output JSON has ExpenseFields and line items. Unlike AnalyzeDocument, AnalyzeExpense has a fixed output schema (especially for receipts).
  • ID Document Analysis (AnalyzeID): Designed for identification documents (U.S. driver’s licenses and passports). Given an image of an ID, Textract returns extracted identity fields. For example, AnalyzeID will return standardized keys like “FirstName”, “DateOfBirth”, “DocumentID”, etc., regardless of whether the ID said “Name” or “Full Name”. It can even combine the front and back sides of a license in one request. This lets businesses quickly extract names, ID numbers, addresses, etc., from IDs without writing a regex. (Other ID formats or non-English IDs are not supported – see Limitations.)
Image Source: aws.amazon.com
  • Lending Analysis (AnalyzeLending): A newer API for mortgage loan documents. It takes a multi-page loan application package, classifies each page (e.g., income docs, credit reports), and then routes pages to the appropriate Textract analysis. In effect, it automates loan-document workflows by splitting, classifying, and extracting data in one step. You call StartLendingAnalysis, and later GetLendingAnalysis and GetLendingAnalysisSummary for results. This feature is currently asynchronous-only and billed by the loan “packet” (page) count. It’s designed for financial institutions to automate the processing of loan apps with minimal manual classification.
  • Integration & APIs: Textract provides SDK support in all AWS SDK languages. It’s also available via AWS CLI and AWS Console (you can upload a PDF in the console, and Textract will show you the detected text). The API reference provides operations like DetectDocumentText, AnalyzeDocument, StartDocumentAnalysis, GetDocumentAnalysis, etc. Each returns JSON with standardized schemas and confidence scores. Textract’s API is straightforward: you specify an S3 bucket and file name, and in the case of AnalyzeDocument, list which features you want (e.g. --feature-types 'TABLES FORMS'). For example, a sample CLI command to detect text is:

which would return a JSON with all detected lines and words.

Collectively, these features allow you to extract nearly any text or form data from documents. If you only need raw text, use DetectDocumentText. If you need structure (like a form response or table data), use AnalyzeDocument with the appropriate features. Need specific fields? Use Queries or expense. Need to verify an ID? Use AnalyzeID. Each feature’s output is returned in JSON “blocks” that your code can parse.

Pricing Essentials

AWS Textract is billed strictly per page, with no minimum commitments or upfront fees. You pay only for the specific APIs and features you use. Below is a streamlined breakdown of everything you need to budget your document-processing pipelines.

1. Free Tier (First 3 Months)

Every new AWS customer automatically receives a three-month Textract trial each month:

  • DetectDocumentText: 1,000 pages free
  • AnalyzeDocument:
    • 1,000 pages free for Signatures
    • 100 pages free for Forms, Tables & Layout
    • 100 pages free for Queries
  • AnalyzeExpense: 100 pages free
  • AnalyzeID: 100 pages free
  • AnalyzeLending: 2,000 pages free

After the trial ends, standard per-page rates apply.

2. Per-Page Rates (US West – Oregon)

Image Source: aws.amazon.com
Tip: If you combine multiple AnalyzeDocument sub-features on the same page, you incur each feature’s fee. For example, requesting both Forms and Tables costs $0.0500 + $0.0150 = $0.0650 per page.

3. Real-World Examples

  • Extracting 100,000 pages (DetectDocumentText)
    • Pages: 100,000
    • Rate: $0.0015/page
    • Total: $150
  • Extracting 2,000,000 pages (DetectDocumentText)
    • First 1 M pages at $0.0015 = $1,500
    • Next 1 M pages at $0.0006 = $600
    • Total: $2,100
Pro Tip: Use the AWS Pricing Calculator to model your specific volumes and feature mix before you run at scale.

4. Hidden Gotchas & Additional Costs

  • Additive Billing: Combining features multiplies costs on a per-page basis.
  • Storage & Callbacks: Standard S3, SNS, and SQS fees still apply.
  • Result Retention: Asynchronous job outputs linger for 7 days in Textract’s bucket—copy them yourself to avoid expiration or extra storage fees.
  • Regional Variance: GovCloud and other regions can add 10–20 % to these rates.

Textract’s pricing is transparent and usage-driven. Leverage the free tier to prototype, then align your feature choices, batch sizes, and batching strategy to your budget. With clear per-page rates and real-time cost monitoring (via CloudWatch or your billing dashboard), you can confidently scale document automation without surprises.

Limitations & Gotchas

Image Source: aws.amazon.com

While powerful, Textract has important limits and constraints to be aware of:

  • Document formats and sizes: Supported input file types are JPEG, PNG, PDF, and TIFF. (JPEG 2000 inside PDF is allowed too.) However, there are size limits: for synchronous calls, images and PDFs must be ≤10 MB and (for PDF/TIFF) 1 page. For asynchronous jobs, images (JPEG/PNG) are ≤10 MB, but PDFs/TIFFs can be up to 500 MB and up to 3000 pages. Also, PDFs must be non-password-protected. Documents must be less than 10000 pixels on any side and max 40 inches. Vertical text (like Japanese vertical writing) is not supported.
  • Languages: Textract currently only supports text detection in English, Spanish, German, Italian, French, and Portuguese. It will not label which language it detected. Queries and ID detection only work on English documents. If your document contains other languages (or vertical text), Textract may fail to extract or output gibberish.
  • Accuracy: Textract’s accuracy depends on input quality. The best practices guide recommends at least 150 DPI resolution and clear, upright text. Blurry scans, odd fonts, or text on complex backgrounds can reduce accuracy. Handwritten text is supported but only in English and typically with lower accuracy. Certain document layouts (merged table cells, rotated text, overlays) can confuse the table/field extraction. Always validate critical fields (e.g. SSNs or contract terms) manually or with confidence checks.
  • Feature limits: There are caps on queries per page (max 15 queries/page sync, 30 async). The Async APIs can only run one job per request (i.e. you get one job ID for one document); to parallelize you must launch multiple jobs or use concurrent Lambda/SQS pipelines. By default, AWS imposes low per-second quotas (e.g. a few transactions per second for Textract start/get calls) which may throttle a large volume of requests (these can be increased via Service Quotas). If you hit a quota limit (like “ThrottlingException”), you’ll need to request a higher limit or batch requests.
  • Cost “gotchas”: Besides the additive per-page costs mentioned above, note that asynchronous processing might generate many pages of results (especially if you analyze forms/tables). You might incur S3 storage costs if you output to a bucket for persistence. Also, if you use powerful queries or custom queries, there is no free tier and those also cost per page.
  • ID and Custom Queries: The AnalyzeID API only recognizes U.S. driver’s licenses and passports ; other IDs or non-U.S. IDs aren’t supported. Custom Query training requires a dataset and has dataset size requirements (up to 2500 training docs). If you try to use queries without following best practices (see below), results can be blank or incorrect.

Textract is not magical – it has limits on input size, language, and layout. Always check the service Quotas documentation  and Limits guidelines for the latest caps. When designing your application, plan for the synchronous vs asynchronous limits, and validate Textract output (using confidence scores or human review) for critical tasks.

Best Practices & Tips

To get the best results, follow these experience-backed tips:

  • Provide high-quality input: Ensure your scanned documents are clear. Use at least 150 DPI resolution. If you have a PDF, don’t re-compress or downsample it – feed Textract the original quality PDF or image. AWS Textract works best with straight (unrotated) text. It can handle up to 45° rotation, but avoid skew or extreme angles if possible. If you have a rotated scan, consider rotating it upright before analysis.
  • Use supported formats natively: Since AWS Textract supports PDF, TIFF, JPEG, and PNG, avoid converting your document to an unsupported or lossy format. For example, don’t convert a PDF to a low-res JPEG unnecessarily. Textract can handle multi-page PDFs/TIFFs asynchronously, which is often simpler than splitting pages manually.
  • Isolate tables from backgrounds: For table extraction, make sure tables have clear borders or separation from graphics/overlays. If a table is over an image background, extraction can be inconsistent. If tables are complex (merged cells, irregular columns), results may be rough; in extreme cases, you might consider treating them as plain text blocks or splitting them into smaller tables.
  • Leverage confidence scores: Every detected block comes with a confidence between 0-100. In critical applications (finance, legal, health), discard or flag results below a threshold (e.g. <90%). For archival or bulk extraction you might accept lower confidence (50%-70%). Design your workflow so that low-confidence items trigger human review or re-processing.
  • Frame Queries carefully: When using Queries, phrase your questions using words exactly as they appear in the document. For example, if the document label is “Date of Birth,” ask “What is the date of birth?” rather than “birth date”. Avoid ambiguous or incomplete questions – be as specific as possible. If the document has sections (like “Applicant” vs “Co-Applicant”), include that context in the query. Use the Pages parameter to limit queries to relevant pages if needed. This precision greatly improves the chance of a correct answer.
  • Optimize Calls: For large workloads, use asynchronous processing with SNS/SQS. For example, you can trigger a Textract job each time a document is uploaded to S3, and have a Lambda function listening on SQS to retrieve and process results. This serverless pattern scales automatically. Also, to reduce costs and latency, group related features in one call (e.g. if you need both forms and tables, ask for both rather than two calls) – but keep in mind the additive pricing. Use the AWS Pricing Calculator to model costs beforehand.
  • Monitor Quotas: Set up CloudWatch Alarms on Textract usage and Service Quota consumption. This will alert you if you approach rate limits or incur unexpectedly high costs. Since Textract has API rate limits (e.g. GetDocumentAnalysis calls per second), catching spikes early can save headaches.
  • Post-process for your needs: The raw Textract JSON is powerful but verbose. Use the blocks API to assemble words into lines or fields. AWS provides a Textract response parser library (for Python/Java/etc) that can simplify extracting key-value pairs and tables. After Textract extraction, you might feed the text into Amazon Comprehend for NLP tasks (sentiment, entity recognition) or into your own business logic. Weaving Textract into a pipeline with Lambda and other services yields the most automation.

Following these practices will improve accuracy and efficiency. For example, one AWS guidance shows that keeping input text upright and clear yielded much better table extraction . Similarly, many users find that filtering on confidence and re-trying low-confidence pages dramatically reduces manual correction.

Quick-Start Tutorial

Let’s walk through a simple step by step example to get you extracting text with Textract:

1. Set up AWS: If you haven’t already, create an AWS account and enable Textract in your region. Then create an IAM user or role with the AmazonTextractFullAccess policy (and, for CLI use, attach AmazonS3ReadOnlyAccess). Configure your AWS CLI (aws configure) with appropriate credentials and region.

2. Upload a document to S3: Create an S3 bucket (via console or aws s3 mb s3://my-textract-bucket). Upload a sample document, for example a scanned PDF or image, to the bucket:

3. Call Textract (Synchronous): Use the AWS CLI to call detect-document-text (for simple OCR) or analyze-document (for forms/tables). For example, to detect all text in sample-form.png:

This returns JSON. You’ll see a "Blocks" array with items of BlockType: "PAGE", "LINE", and "WORD", each with "Text": "..." and geometry .

4. Call Textract (Forms/Tables): If your document has forms or tables, use analyze-document. For instance, to analyze tables and forms:

The output JSON will include "BlockType": "KEY_VALUE_SET" (for forms) and "BlockType": "TABLE" plus cell data.

5. View Results: The CLI output is JSON; examine it to find extracted text. For a quick look, you could pipe the output to jq. For example:

This will list each line of detected text.

6.Experiment Further: Try calling analyze-expense on a receipt image, or use the AWS Textract Console (AWS Management Console → Amazon Textract) and upload a document to visually inspect detection and tables. You can also try the new Layout feature by adding --feature-types '["TABLES","FORMS","LAYOUT"]'.

You’ve now successfully extracted text from a document with Textract! From here, you can integrate these steps into your applications or scripts, handle the JSON results in code (using AWS SDKs), or set up triggers (e.g., run Textract automatically on every new S3 upload).

Conclusion

AWS Textract streamlines document processing by combining powerful OCR with intelligent data extraction. Key takeaways:

  • Automated extraction: Textract can read printed and handwritten text, detect forms, tables, and even layouts, so you can convert docs to data with minimal effort .
  • No ML expertise needed: Built on Amazon’s deep-learning tech, Textract works out-of-the-box via simple APIs.
  • Flexible pricing: You pay per page with free tier options, but be mindful of the per-feature cost model and plan for scale.
  • Integrate & extend: Use Textract as part of AWS workflows, feeding results to Comprehend or databases, or comparing it to alternatives like Azure Document Intelligence or Google Document AI depending on your ecosystem and language needs.

Next steps: Try Textract on real documents in your workflow. Start with the free tier to prototype (the Textract console makes testing quick). Assess accuracy and set confidence thresholds. If you work in a multi-cloud environment, compare Textract with alternative OCR services on your specific data. And as always, refer to the official AWS Textract documentation for details and updates. With Textract in your toolkit, you can dramatically accelerate document-centric processes and build smarter, AI-driven applications.

Monitor Your AWS Textract Spend with Cloudchipr

Setting up AWS Textract is only the beginning—actively managing cloud spend is vital to maintaining budget control. Cloudchipr offers an intuitive platform that delivers multi‑cloud cost visibility, helping you eliminate waste and optimize resources across AWS, Azure, and GCP.

Key Features of Cloudchipr

Automated Resource Management:

Easily identify and eliminate idle or underused resources with no-code automation workflows. This ensures you minimize unnecessary spending while keeping your cloud environment efficient.

Rightsizing Recommendations:

Receive actionable, data-backed advice on the best instance sizes, storage setups, and compute resources. This enables you to achieve optimal performance without exceeding your budget.

Commitments Tracking:

Keep track of your Reserved Instances and Savings Plans to maximize their use.

Live Usage & Management:

Monitor real-time usage and performance metrics across AWS, Azure, and GCP. Quickly identify inefficiencies and make proactive adjustments, enhancing your infrastructure.

DevOps as a Service:

Take advantage of Cloudchipr’s on-demand, certified DevOps team that eliminates the hiring hassles and off-boarding worries. This service provides accelerated Day 1 setup through infrastructure as code, automated deployment pipelines, and robust monitoring. On Day 2, it ensures continuous operation with 24/7 support, proactive incident management, and tailored solutions to suit your organization’s unique needs. Integrating this service means you get the expertise needed to optimize not only your cloud costs but also your overall operational agility and resilience.

Experience the advantages of integrated multi-cloud management and proactive cost optimization by signing up for a 14-day free trial today, no hidden charges, no commitments.

Share this article:
Subscribe to our newsletter to get our latest updates!
Thank you!
Your submission has been received!
Oops! Something went wrong while submitting the form.
Related articles