Query PDF

Query PDF

Query PDF is a REST API tool that provides a programmatic way to retrieve a wide range of insights about a PDF document. It allows developers to check for conditional properties, metadata, and content details such as forms, fonts, security settings, and digital signatures. This tool is essential for conditional processing, enabling you to automate workflows and trigger subsequent actions based on a file’s unique characteristics.

Key Benefits of Query PDF API

  • Perform over 25 different queries in a single API call, including checks for document metadata, embedded fonts, and JavaScript, for a comprehensive overview of a PDF's properties.
  • Validate PDF/A conformance with the industry-standard veraPDF validation engine, returning a simple true or false value for easy programmatic checks without complex reporting.
  • Automate workflows with conditional processing, saving time and resources by using file properties to determine if you need to apply OCR, convert to PDF/A, or perform other operations.
  • Seamlessly audit page boundaries (MediaBox, CropBox, BleedBox, TrimBox, ArtBox) before making precise document dimension adjustments with the Set Page Boxes API.
  • Identify accessibility features by checking for the presence of structure tags, ensuring your documents meet compliance standards.
  • Retrieve and leverage custom metadata, returned as a JSON list of key:value pairs, enabling you to extract unique data properties added by other applications.
  • Extract key document information, including whether a file contains signatures, passwords, or forms (Acroforms or XFA), to drive secure and specialized workflows.
Build Your Solution

You have document processing problems, we have Solutions. Explore the many ways pdfRest can align your documents with your business objectives.

Browse all solutions
A PDF is sent to pdfRest for investigation under a magnifying glass then sent to a ChatGPT bot for further analysis
Integrate pdfRest with ChatGPT to Generate PDF Info Summary
Parse PDF Files to Streamline Data Extraction
Parse PDF Files to Streamline Data Extraction
The pdfRest logo is added to the Microsoft Power Automate logo with a representation of a PNG to PDF conversion workflow
Integrate pdfRest with Microsoft Power Automate
Ensure GDPR Compliance for PDF Processing with EU-Based Cloud API
Ensure GDPR Compliance for PDF Processing with EU-Based Cloud API
Detect and Repair Non-Conformant PDF/A Documents
Detect and Repair Non-Conformant PDF/A Documents
Add Page Numbers to PDF Files
Add Page Numbers to PDF Files
Why is pdfRest the best API to get info from PDF?
pdfRest offers the industry’s most efficient API for extracting PDF metadata and document properties, empowering you to drive conditional processing, validate industry-standard PDF/A compliance, and run over 25 distinct queries in a single, lightning-fast call.

Extract PDF Metadata to Automate Document Workflows

Our API delivers precise document intelligence that allows you to programmatically assess files and determine the next steps for each document. By extracting valuable metadata and file properties, you can build smart, conditional logic into your applications to solve real-world processing challenges. Common workflow automations include:

  • Preparing files for prepress: Seamlessly audit page boundaries (MediaBox, CropBox, BleedBox, TrimBox, and ArtBox) so you know exactly how to adjust margins and boundaries using our Set Page Boxes API.
  • Optimizing document storage: Conditionally split, compress, or route PDFs based on their exact page count or total file size.
  • Securing sensitive data: Automatically detect and encrypt files that do not already have the necessary security permissions applied.
  • Quality assurance routing: Confirm that inbound PDFs contain expected elements, such as accessibility tags, digital signatures, or forms, before sending them to downstream systems or intended audiences.

Verify PDF/A Compliance and Document Archiving Standards

When preparing documents for long-term storage or legal compliance, knowing whether a file meets strict archiving standards is critical. While competitors often generate convoluted validation reports that require custom code to decipher, pdfRest produces straightforward, actionable results you can depend on. Powered by veraPDF, the industry-recognized standard for PDF/A validation, our API seamlessly checks conformance levels.

  • Instant verification: Receive a simple true or false boolean value in your JSON response to instantly confirm a document's compliance status.
  • Eliminate developer overhead: Avoid wasting valuable engineering hours trying to parse through complex XML validation logs or superfluous reporting data.
  • Smart conversion routing: Automatically trigger a conversion to PDF/A only when a document is flagged as non-conformant, saving server processing time and reducing API costs.

Get PDF Properties and Document Information in a Single API Call

Many PDF processing libraries require developers to make separate, costly API requests just to check different document attributes. pdfRest eliminates this bottleneck by letting you retrieve all the information you need about a PDF and its contents simultaneously. Simply send one API request with your PDF file and a comma-separated list of your required checks.

  • Unmatched flexibility: Pick and choose exactly what matters from a list of over 25+ query options, or simply include the 'all' query to return everything at once.
  • Clean, structured data: Get a rapid response containing all requested information as easy-to-parse key:value pairs in standard JSON format.
  • Optimized performance: Reduce network latency and computational overhead by getting all the answers you need without heavy reports to parse or superfluous data to sift out.

See Customize Your Solution below for more details about all of the supported queries.

Customize Your Solution

Learn about the parameters for this tool to create your custom solution.

Queries
  • all
    • A comprehensive query that returns the document's full profile at once. Use this alone to retrieve every supported property without having to list individual options.
  • tagged
    • Checks for presence of structure tags in the input document.
    • Returns true or false
  • image_only
    • Checks if the document is 'image only' meaning that it will only feature a series of embedded graphical image files, one per page and does not have any text or other features common to PDF documents, except for some metadata.
    • Returns true or false
  • title
    • The title of the PDF as listed in the metadata.
    • Returns a string which may be empty if the document does not have a title
  • subject
    • The subject of the PDF as listed in the metadata.
    • Returns a string which may be empty if the document does not have a subject
  • author
    • The author of the PDF as listed in the metadata.
    • Returns a string which may be empty if the document does not have an author
  • producer
    • The producer of the PDF as listed in the metadata.
    • Returns a string which may be empty if the document does not have a producer
  • creator
    • The creator of the PDF as listed in the metadata.
    • Returns a string which may be empty if the document does not have a creator
  • creation_date
    • The creation date of the PDF as listed in the metadata.
    • Returns a string which may be empty if the document does not have a creation date
  • modified_date
    • The most recent modification date of the PDF as listed in the metadata.
    • Returns a string which may be empty if the document does not have a modification date
  • keywords
    • The keywords of the PDF as listed in the metadata.
    • Returns a string which may be empty if the document does not have keywords
  • custom_metadata
    • Retrieves custom metadata from the PDF
    • Returns a JSON list of key:value pairs, where each pair represents a custom property and its value.
  • doc_language
    • The language that the file claims to be written in.
    • Returns a string
  • page_count
    • The number of pages in the PDF document.
    • Returns an integer
  • page_boxes
    • Retrieves the dimensions of all page boundaries (MediaBox, CropBox, BleedBox, TrimBox, and ArtBox) for each page in the document.
    • Returns a JSON object mapping each page number to its respective box coordinates, along with a boolean indicating whether each box matches the media box for that page.
  • contains_annotations
    • Checks whether the document contains annotations, such as notes, highlighted text, file attachments, crossed out text, and text callout boxes.
    • Returns true or false
  • contains_signature
    • Checks if the document contains any digital signatures.
    • Returns true or false
  • pdf_version
    • Retrieves the version of the PDF standard that the document was created with.
    • Returns a string of the form X.Y.Z where X, Y, and Z are the major, minor, and extension versions respectively
  • file_size
    • Retrieves the size of the input file in bytes.
    • Returns an integer
  • filename
    • The name of the input file.
    • Returns a string
  • restrict_permissions_set
    • Checks whether the document has restrict permissions set to prevent printing, copying, signing etc.
    • Returns true or false
  • contains_xfa
    • Checks whether the document contains XFA forms.
    • Returns true or false
  • contains_acroforms
    • Checks whether the document contains Acroforms.
    • Returns true or false
  • contains_javascript
    • Checks whether the document contains javascript.
    • Returns true or false
  • contains_transparency
    • Checks whether the document contains transparent objects.
    • Returns true or false
  • contains_embedded_file
    • Checks whether the document contains one or more embedded files.
    • Returns true or false
  • uses_embedded_fonts
    • Checks whether the document contains fully embedded fonts.
    • Returns true or false
  • uses_nonembedded_fonts
    • Checks whether the document contains non-embedded fonts.
    • Returns true or false
  • pdfa
    • Checks whether the document claims and conforms to a PDF/A standard.
    • Returns true or false
  • requires_password_to_open
    • Checks whether the document requires a password to open.
    • Returns true or false.
    • Note: A document requiring a password cannot be opened by this route and will not be able to return much other information

Safe & Secure

Confidently process your sensitive data with pdfRest. Our platform is built for robust, Enterprise-grade security and compliance. We meet rigorous standards for GDPR and HIPAA, and our controls are independently audited to ensure strict SOC 2 Type 2 compliance. Your data's protection is our commitment.

Frequently Asked Questions
Need more help? Contact Us or visit our documentation.

The Query PDF API is a REST API tool that provides a programmatic way to retrieve detailed information about a PDF document. It returns valuable insights into a file's metadata, contents, and conditional properties, such as whether it contains forms, signatures, or specific security settings.

You can get a wide range of information about a PDF's metadata, content, and security settings by specifying any of the following queries:

  • Global Shortcut:

    • all: A comprehensive query that returns the document's full profile at once. Use this alone to retrieve every supported property without having to list individual options.
  • Metadata Queries:

    • title: The title of the PDF.
    • subject: The subject of the PDF.
    • author: The author of the PDF.
    • producer: The producer of the PDF.
    • creator: The creator of the PDF.
    • creation_date: The creation date of the PDF.
    • modified_date: The most recent modification date of the PDF.
    • keywords: The keywords of the PDF.
    • doc_language: The language that the file claims to be written in.
    • custom_metadata: Retrieves any custom metadata from the PDF and presents it as a JSON list of key:value pairs.
  • Document Properties:

    • page_count: The total number of pages in the PDF document.
    • page_boxes: Retrieves the dimensions of all page boundaries (MediaBox, CropBox, BleedBox, TrimBox, and ArtBox) for each page in the document. Returns a JSON object mapping each page number to its respective box coordinates, along with a boolean indicating whether each box matches the media box for that page.
    • pdf_version: The version of the PDF standard the document was created with (e.g., "1.7").
    • file_size: The size of the input file in bytes.
    • filename: The name of the input file.
    • pdfa: Checks whether the document claims and conforms to a PDF/A standard.
    • pdfua_claim: Checks whether the document claims to conform to a PDF/UA standard.
    • pdfe_claim: Checks whether the document claims to conform to a PDF/E standard.
    • pdfx_claim: Checks whether the document claims to conform to a PDF/X standard.
  • Content & Structure Checks:

    • tagged: Checks for the presence of structure tags in the document, which are important for accessibility.
    • image_only: Checks if the document is 'image only' and lacks text or other common PDF features.
    • contains_annotations: Checks for the presence of annotations, such as notes, highlights, or attachments.
    • contains_signature: Checks if the document contains any digital signatures.
    • contains_xfa: Checks whether the document contains XFA forms.
    • contains_acroforms: Checks whether the document contains Acroforms.
    • contains_javascript: Checks whether the document contains JavaScript.
    • contains_transparency: Checks whether the document contains transparent objects.
    • contains_embedded_file: Checks for one or more embedded files.
    • uses_embedded_fonts: Checks whether the document contains fully embedded fonts.
    • uses_nonembedded_fonts: Checks whether the document contains non-embedded fonts.
  • Security & Permissions:

    • restrict_permissions_set: Checks if the document has security restrictions applied to prevent actions like printing or copying.
    • requires_password_to_open: Checks if the document requires a password to open and view.

The API returns a JSON response containing easy-to-parse key:value pairs. You will get a separate field for each query you requested, with the corresponding value. This format is simple to use in your code without needing to parse complex reports or sift through unnecessary data, allowing you to easily leverage this information to drive conditional processing. For example, check to see if a file is image-only to determine whether to apply OCR on the document, or check for PDF/A conformance to decide whether to trigger a PDF/A conversion step.

You can run as many queries as you need in a single API call. The API allows you to send a comma-separated list of queries, and it will return all the requested information in a single, comprehensive response, making it highly efficient.

You can use the page_boxes query to retrieve the exact coordinates of all page boundaries (MediaBox, CropBox, BleedBox, TrimBox, and ArtBox) for every page in your document. Once you have audited these current dimensions, you can precisely adjust margins, crop pages, or define specific prepress boundaries using the Set Page Boxes API Tool.

The Query PDF tool can validate a document against the PDF/A standard using the pdfa query. This validation is powered by veraPDF, the industry-standard validation engine, ensuring accurate results. The API returns a simple true or false value, so you can programmatically check for compliance without needing to parse a full validation report.

Conditional processing involves using the information returned by the Query PDF tool to determine the next action for a document. For example, you can use the requires_password_to_open query to identify and separate protected files, or use the pdfa query to convert non-compliant documents to the PDF/A standard only when necessary. This saves time and resources by automating workflows based on a file's specific properties.

If a document is password-protected or corrupted, the API will not be able to complete all the requested queries. In these cases, the API will still provide a response that includes an allQueriesProcessed field with a false value and a human-readable warning message explaining why the queries could not be completed.

Yes. Instead of building a long list of individual parameters, you can use the all query by itself. This standalone shortcut instructs the Query PDF API to run every supported check, returning the full document profile in one comprehensive JSON response.

pdfRest offers the best solution because it enables conditional processing to automate workflows based on document properties, provides reliable PDF/A validation powered by veraPDF, and allows you to retrieve over 25 different queries with a single, efficient API call.

Ensuring the security and privacy of your data is a top priority at pdfRest. Our platform is built for robust, enterprise-grade security and compliance, including GDPR and HIPAA. All your files are secured with encryption during both transit and at-rest, and they are permanently deleted after the stated file retention period (30 minutes for most plans). For complete details, please refer to our Data Processing Agreement (DPA).

To facilitate GDPR compliance for your querying workflows, pdfRest processes your data within the European Union and adheres to other strict data protection requirements. You can ensure all processing occurs within the EU by sending your API calls to the dedicated EU endpoint at http://eu-api.pdfrest.com/pdf-info. Please note that a GDPR usage fee may apply for some plans. For more information, please review our Data Processing Agreement.

Integrating the Query PDF API is straightforward. We offer comprehensive API documentation and code samples in many programming languages. The API Lab also allows you to test and generate code snippets directly from your browser, simplifying the setup and ensuring a smooth integration experience.

You can easily query PDFs without writing any code. Our API Lab is a no-code tool that lets you upload files and send API calls directly from your browser. For a more conversational workflow, you can also use pdfAssistant.ai to check a PDF's metadata, or check if a PDF contains extractable text, or check if a PDF is tagged for accessibility with an intuitive chat-based interface powered by AI technology.

Yes, pdfRest offers two self-hosted options. The pdfRest API Toolkit on AWS allows you to deploy and manage your own backend processing infrastructure within your AWS environment with pay-as-you-go pricing through the AWS Marketplace. The pdfRest API Toolkit Container provides ultimate environmental control as a Docker Container, giving you the flexibility to run the API in on-premises data centers or public/private cloud environments with a flexible, custom licensing model.

Generate a self-service API Key now!
Create your FREE API Key to start processing PDFs in seconds, only possible with pdfRest.