Document(
shards: List[google.cloud.documentai_v1.types.document.Document],
gcs_bucket_name: Optional[str] = None,
gcs_prefix: Optional[str] = None,
gcs_input_uri: Optional[str] = None,
)Represents a wrapped Document.
This class hides away the complexities of using Document protobuf
response outputted by BatchProcessDocuments or ProcessDocument
methods and implements convenient methods for searching and
extracting information within the Document.
Optional. The name of the gcs bucket.
Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_bucket_name=bucket.
:type: Optional[str]
(List[Entity]): A list of Entities in the Document.
Attributes |
|
|---|---|
| Name | Description |
gcs_prefix |
Optional[str]
Optional. The prefix of the json files in the target_folder. Format: gs://{bucket_name}/{optional_folder}/{target_folder}/ where gcs_prefix={optional_folder}/{target_folder}.
For more information please take a look at https://cloud.google.com/storage/docs/json_api/v1/objects/list .
|
pages |
Optional[str]
(List[Page]): A list of Pages in the Document. |
Methods
convert_document_to_annotate_file_json_response
convert_document_to_annotate_file_json_response()Convert OCR data from Document.proto to JSON str of AnnotateFileResponse for Vision API.
| Returns | |
|---|---|
| Type | Description |
str |
JSON string of TextAnnotations. |
convert_document_to_annotate_file_response
convert_document_to_annotate_file_response()Convert OCR data from Document.proto to AnnotateFileResponse.proto for Vision API.
| Returns | |
|---|---|
| Type | Description |
AnnotateFileResponse |
Proto with TextAnnotations. |
entities_to_bigquery
entities_to_bigquery(
dataset_name: str, table_name: str, project_id: Optional[str] = None
)Adds extracted entities to a BigQuery table.
| Parameters | |
|---|---|
| Name | Description |
dataset_name |
str
Required. Name of the BigQuery dataset. |
table_name |
str
Required. Name of the BigQuery table. |
project_id |
Optional[str]
Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment. |
| Returns | |
|---|---|
| Type | Description |
bigquery.job.LoadJob |
The BigQuery LoadJob for adding the entities. |
entities_to_dict
entities_to_dict()Returns Dictionary of entities in document.
| Returns | |
|---|---|
| Type | Description |
Dict |
The Dict of the entities indexed by type. |
export_images
export_images(
output_path: str, output_file_prefix: str, output_file_extension: str
)Exports images from Document to files.
| Parameters | |
|---|---|
| Name | Description |
output_path |
str
Required. The path to the output directory. |
output_file_prefix |
str
Required. The output file name prefix. |
output_file_extension |
str
Required. The output file extension. Format: |
| Returns | |
|---|---|
| Type | Description |
List[str] |
A list of output image file names. Format: {output_path}/{output_file_prefix}_{index}_{Entity.type_}.{output_file_extension} |
form_fields_to_bigquery
form_fields_to_bigquery(
dataset_name: str, table_name: str, project_id: Optional[str] = None
)Adds extracted form fields to a BigQuery table.
| Parameters | |
|---|---|
| Name | Description |
dataset_name |
str
Required. Name of the BigQuery dataset. |
table_name |
str
Required. Name of the BigQuery table. |
project_id |
Optional[str]
Optional. Project ID containing the BigQuery table. If not passed, falls back to the default inferred from the environment. |
| Returns | |
|---|---|
| Type | Description |
bigquery.job.LoadJob |
The BigQuery LoadJob for adding the form fields. |
form_fields_to_dict
form_fields_to_dict()Returns Dictionary of form fields in document.
| Returns | |
|---|---|
| Type | Description |
Dict |
The Dict of the form fields indexed by type. |
from_batch_process_metadata
from_batch_process_metadata(
metadata: google.cloud.documentai_v1.types.document_processor_service.BatchProcessMetadata,
)Loads Documents from Cloud Storage, using the output from BatchProcessMetadata.
.. code-block:: python
from google.cloud import documentai
operation = client.batch_process_documents(request)
operation.result(timeout=timeout)
metadata = documentai.BatchProcessMetadata(operation.metadata)
| Parameter | |
|---|---|
| Name | Description |
metadata |
documentai.BatchProcessMetadata
Required. The operation metadata after a |
| Returns | |
|---|---|
| Type | Description |
List[Document] |
A list of wrapped documents from gcs. Each document corresponds to an input file. |
from_batch_process_operation
from_batch_process_operation(location: str, operation_name: str)Loads Documents from Cloud Storage, using the operation name returned from batch_process_documents().
.. code-block:: python
from google.cloud import documentai
operation = client.batch_process_documents(request)
operation_name = operation.operation.name
| Parameters | |
|---|---|
| Name | Description |
location |
str
Required. The location of the processor used for |
operation_name |
str
Required. The fully qualified operation name for a |
| Returns | |
|---|---|
| Type | Description |
List[Document] |
A list of wrapped documents from gcs. Each document corresponds to an input file. |
from_document_path
from_document_path(document_path: str)Loads Document from local document_path.
.. code-block:: python
from google.cloud.documentai_toolbox import document
document_path = "/path/to/local/file.json
wrapped_document = document.Document.from_document_path(document_path)
| Parameter | |
|---|---|
| Name | Description |
document_path |
str
Required. The path to the document.json file. |
| Returns | |
|---|---|
| Type | Description |
Document |
A document from local document_path. |
from_documentai_document
from_documentai_document(
documentai_document: google.cloud.documentai_v1.types.document.Document,
)Loads Document from local documentai_document.
.. code-block:: python
from google.cloud import documentai
from google.cloud.documentai_toolbox import document
documentai_document = client.process_documents(request).document
wrapped_document = document.Document.from_documentai_document(documentai_document)
| Parameter | |
|---|---|
| Name | Description |
documentai_document |
documentai.Document
Optional. The Document.proto response. |
| Returns | |
|---|---|
| Type | Description |
Document |
A document from local documentai_document. |
from_gcs
from_gcs(
gcs_bucket_name: str, gcs_prefix: str, gcs_input_uri: Optional[str] = None
)Loads Document from Cloud Storage.
| Parameters | |
|---|---|
| Name | Description |
gcs_bucket_name |
str
Required. The gcs bucket. Format: Given |
gcs_prefix |
str
Required. The prefix to the location of the target folder. Format: Given |
gcs_input_uri |
str
Optional. The gcs uri to the original input file. Format: |
| Returns | |
|---|---|
| Type | Description |
Document |
A document from gcs. |
get_entity_by_type
get_entity_by_type(target_type: str)Returns the list of Entities of target_type.
| Parameter | |
|---|---|
| Name | Description |
target_type |
str
Required. target_type. |
| Returns | |
|---|---|
| Type | Description |
List[Entity] |
A list of Entity matching target_type. |
get_form_field_by_name
get_form_field_by_name(target_field: str)Returns the list of FormFields named target_field.
| Parameter | |
|---|---|
| Name | Description |
target_field |
str
Required. Target field name. |
| Returns | |
|---|---|
| Type | Description |
List[FormField] |
A list of FormField matching target_field. |
search_pages
search_pages(target_string: Optional[str] = None, pattern: Optional[str] = None)Returns the list of Pages containing target_string or text matching pattern.
| Parameters | |
|---|---|
| Name | Description |
target_string |
Optional[str]
Optional. target str. |
pattern |
Optional[str]
Optional. regex str. |
| Returns | |
|---|---|
| Type | Description |
List[Page] |
A list of Pages. |
split_pdf
split_pdf(pdf_path: str, output_path: str)Splits local PDF file into multiple PDF files based on output from a Splitter/Classifier processor.
| Parameters | |
|---|---|
| Name | Description |
pdf_path |
str
Required. The path to the PDF file. |
output_path |
str
Required. The path to the output directory. |
| Returns | |
|---|---|
| Type | Description |
List[str] |
A list of output pdf files. |