Data validation and change detection with checksums

To validate data integrity and detect changes, Cloud Storage encourages you to use checksums when transferring data to and from your buckets. This page provides information about how checksums are used within Cloud Storage and how to specify checksums when sending requests.

Prevent data corruption by using checksums

Data can sometimes get corrupted while being transferred to or from the cloud because of software or hardware bugs, memory or router errors, electrical disturbances, or changes to the source data during extended period file uploads.

To help protect you against data corruption, Cloud Storage supports the use of CRC32C and MD5 checksums for verifying the integrity of your data and detecting changes in your data.

CRC32C is the recommended validation method for performing integrity checks. Validation using MD5 hashes is supported for single-file uploads but isn't supported for objects that are uploaded in chunks, such as composite objects and objects uploaded using an XML API multipart upload.

Checksums for data writes

For object writes, the client calculates the checksum of the local file and attaches it to the HTTP headers of the object upload request. The server receives the data payload, calculates its own checksum, and validates the data by comparing both checksums after the upload completes. If the checksums match, the object is stored in Cloud Storage along with its checksums. If the checksums don't match, the write request is rejected with a BadRequestException: 400 error.

Server-side validation for data writes

Cloud Storage performs server-side validation in the following cases:

  • When you supply an object's MD5 or CRC32C hash in an object upload request. To learn about types of object uploads, see Object uploads.

  • When you perform a copy or rewrite request within Cloud Storage. For object copy and rewrite requests, Cloud Storage automatically performs server-side validation based on a non-editable checksum stored with the source object.

JSON API single-request (media) uploads

For JSON API media uploads, you can specify checksums in the X-Goog-Hash header of the request. For example:

curl -X POST --data-binary @Desktop/dog-pic.jpeg \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: image/jpeg" \
    -H "X-Goog-Hash: crc32c=n03x6A==" \
    "https://storage.s3nsapis.fr/upload/storage/v1/b/my-bucket/o?uploadType=media&name=dog-pic.jpeg"

JSON API multipart uploads

For JSON API multipart uploads, you can specify checksums as part of the request container, either in the object metadata section or under a third boundary string. For details on the JSON structure and valid keys of an object, see the Objects resource representation.

The following example specifies a CRC32C checksum in the object metadata portion of a request container:

--separator_string
Content-Type: application/json; charset=UTF-8

{
"name":"my-document.txt",
"crc32c": "n03x6A=="
}

--separator_string
Content-Type: text/plain

This is a text file.
--separator_string--

The following example specifies a CRC32C checksum in the third boundary string of a request container:

--separator_string
Content-Type: application/json; charset=UTF-8

{
"name":"my-document.txt"
}

--separator_string
Content-Type: text/plain

This is a text file.

--separator_string
Content-Type: application/json; charset=UTF-8

{ "crc32c": "n03x6A==" }
--separator_string--

JSON API resumable uploads

For JSON API resumable uploads, you can specify checksums in the X-Goog-Hash header of the final request that completes the upload. For example:

curl -i -X PUT --data-binary @Desktop/dog-pic.jpeg \
      -H "Content-Length: 2000000" \
      -H "X-Goog-Hash: crc32c=n03x6A==" \
      "SESSION_URI"

The checksum specified in the final request is calculated from the whole object, not just the object data in the final request.

XML API single-request uploads

For XML API single-request uploads, you can specify checksums in the x-goog-hash header of the request.

For example:

curl -X PUT --data-binary @Desktop/dog-pic.jpeg \
    -H "Authorization: Bearer $(gcloud auth print-access-token)" \
    -H "Content-Type: image/jpeg" \
    -H "x-goog-hash: crc32c=n03x6A==" \
    "https://storage.googleapis.com/my-bucket/dog-pic.jpeg"

XML API single-request uploads also accept the standard HTTP Content-MD5 header. For details, refer to the Content-MD5 specification.

XML API multipart uploads

For XML API multipart uploads, you can specify a CRC32C checksum for the entire object or an individual checksum for each upload part. To specify an individual checksum for an upload part, include the x-goog-hash header in the request for that specific part.

For example:

PUT /dog-pic.jpeg?partNumber=1&uploadId=ABgVH8 HTTP/1.1
Host: my-bucket.storage.googleapis.com
Content-Length: 1000000
x-goog-hash: crc32c=n03x6A==

Only CRC32C checksums can be used to verify the integrity of XML API multipart uploads. MD5 checksums aren't supported.

gRPC uploads

When uploading objects using gRPC, you can specify object-level checksums in the first or last WriteObject message of any upload request, whether it's a single-shot upload or a resumable upload.

Additionally, gRPC supports per-message checksums. Each WriteObject message contains data chunks of up to 2 MiB, and each chunk can include its own checksum. You can specify per-message checksums in place of or alongside an object-level checksum.

Parallel composite uploads

In the case of parallel composite uploads, you should perform an integrity check for each component upload and then use preconditions with the upload compose request to protect against race conditions. Compose requests don't get server-side validation, so you should perform client-side validation on the new composite object if you want an end-to-end integrity check.

Google Cloud CLI copies and rewrites

In the gcloud CLI, data copied to or from a Cloud Storage bucket gets automatically validated. For cp, mv, and rsync commands, the gcloud CLI uses MD5 or CRC32C checksums to determine if there is a difference between the version of an object found at the source and the version found at the destination. If the checksum of the source data doesn't match the checksum of the destination data, the gcloud CLI deletes the invalid copy and prints a warning message. This very rarely happens. If it does, you should retry the operation.

This automatic validation occurs after the object is finalized and invalid objects are visible for 1-3 seconds before they're identified and deleted. Additionally, the gcloud CLI might be interrupted after the upload completes but before it performs the validation, leaving the invalid object in place. These issues can be avoided when uploading single files to Cloud Storage by using server-side validation, which occurs when you use the --content-md5 flag to specify an MD5 hash.

The Google Cloud CLI ignores the --content-md5 flag for objects that don't have an MD5 hash.

Change detection for rsync

The gcloud storage rsync command compares checksums in the following scenarios to determine whether to skip a transfer:

  • The source and destination are both Cloud Storage buckets and the object has an MD5 or CRC32C checksum in both buckets.

  • The object does not have a file modification time (mtime) in either the source or destination.

In cases where an object has an mtime value in both the source and destination, such as when the source and destination are file systems, the rsync command compares the objects' size and mtime value instead of using checksums. Similarly, if the source is a bucket and the destination is a local file system, the rsync command uses the time created for the source object as a substitute for mtime, and the command does not use checksums.

If neither mtime nor checksums are available, rsync only compares file sizes when determining if there is a change between the source version of an object and the destination version. For example, neither mtime nor checksums are available when comparing composite objects with objects at a cloud provider that doesn't support CRC32C, because composite objects don't have MD5 checksums.

Client-side validation for data writes

You can perform client-side validation of your uploads by issuing a request for the uploaded object's metadata, comparing the uploaded object's hash value to the expected value, and deleting the object in case of a mismatch. This method is useful if the object's MD5 or CRC32C hash isn't known at the start of the upload.

The following table shows the clients in Cloud Storage that support calculating checksums for object writes by default, including the client versions that support checksums.

Client Versions that support checksums
Cloud Storage C++ client library 2.46 and later
Cloud Storage Go client library 1.60.0 and later
Cloud Storage Java client library 2.62 and later
Cloud Storage Node.js client library 7.19.0 and later
Cloud Storage PHP client library 1.51.0 and later
Cloud Storage Python client library 3.7.0 and later
Cloud Storage Ruby client library 1.60.0
Cloud Storage connector
  • 3.0.18 and later for the 3.0.x Cloud Storage connector
  • 3.1.14 and later for the 3.1.x Cloud Storage connector
  • 4.0.3 and later for the 4.0.x Cloud Storage connector
Cloud Storage FUSE 3.8.0 and later
Google Cloud CLI

Checksums for data reads

For object downloads, the server sends the object along with its stored checksum in the response. The client calculates its own checksum of the downloaded file based on the bytes it received and compares the two checksums to verify data integrity.

Some client libraries don't automatically perform checksum validation on downloaded objects. Your application might need to independently calculate the checksum of the downloaded file using the received bytes and compare it against the server-supplied hash to verify data integrity.

Client-side validation for reads

To perform an integrity check for downloaded data, calculate the checksum as the data is received and compare your results to the server-supplied checksum.

Server-side checksums are based on the complete object as it's stored in Cloud Storage, which means that the following types of downloads can't be validated against server-supplied checksums:

  • Downloads that undergo decompressive transcoding: the server-supplied checksum represents the object in its compressed state, while the served data has compression removed and consequently has a different checksum value.

  • A response that contains only a portion of the object data: this type of response occurs for Range requests.

    gRPC ranged reads are an exception to this bullet and support end-to-end validation. In gRPC ranged reads, Cloud Storage validates data by including a unique CRC32C checksum inside every individual response chunk of a stream, which lets you client instantly verify that the specific block of data wasn't corrupted in transit. For broader validation, the stream also provides the entire object's full checksum, which advanced clients can use to calculate a rolling total and verify the integrity of the larger file.

    If your application needs to read object ranges instead of full objects at once, we recommend using gRPC. Otherwise, we recommend using ranged requests only for restarting the download of a full object after the last received offset, where you can calculate and validate the checksum after the full download completes.

When validating your download, a mismatch between your calculated checksum and the server-supplied checksum indicates that the data was corrupted in transit. In these cases, you should discard the corrupted data and use the recommended retry logic to retry the request.

What's next