With our transition to Google Cloud Storage (GCS), this article will provide details about Google Cloud Storage and how it compares to our current legacy storage service. We'll also dive into how you can migrate data from your legacy storage to your new Google Cloud Storage buckets, and how to cutover your legacy storage origin to your new Google Cloud Storage bucket seamlessly.
This article also covers how you can create a new Google Cloud Storage bucket and how to access it, upload data, delete data, viewing the content in your bucket, and listing the data and details.
Migration FAQ
StackPath is sunsetting our legacy storage offerings and providing access to object storage powered by GCS.
Google Cloud Storage is an object storage utility, with storage locations worldwide. With GCS, you will enjoy the peace of mind of low latency and 99.999999999% annual durability. You can find more details about the availability of the GCS storage classes here.
More details about Google Cloud Storage in general can be found here.
There are 3 tools we recommend to access your GCS data. The first two tools, gsutil and rclone, are CLI tools, while Cyberduck is a GUI tool.
Some key features that differ between our legacy storage and GCS include the following:
- Our legacy storage does not have an option to restrict read access as it was built for internal communication from our CDN when pulling content. However, GCS allows users to create private buckets, where the public cannot read content without authorization. To access a private GCS bucket, you can use AWS V4 signing, which is discussed further in this article. More information about signing your GCS dataset can be found here.
- FTP/SFTP and RSYNC access are not available with GCS. This document contains other options to access GCS.
Your data will remain intact until January 31, 2021. StackPath is offering assistance with the migration of data from legacy platforms to GCS buckets for qualifying datasets. If desired, we will transfer your data automatically to GCS buckets.
You will need to adjust your integrations to leverage tools compatible with GCS, such as gsutil or rclone. These tools are discussed more in this article. You will also need to migrate any CDN origin configurations to use your GCS buckets. Assistance in executing CDN origin migrations will be provided and can be executed with no downtime.
You can log into StrikeTracker to view and manage your GCS buckets, service accounts, and associated service account credentials. If you choose to opt into migration assistance, buckets will be created for you and data populated from your existing storage. This information can be found in "How do I create a GCS bucket?" below.
Yes, you may choose a location that is best suited to your needs.
Currently we offer the following locations.
- europe-west-4
- us-east-4
- us-west-4
By default, GCS does pass through a Cache-Control header which will assign all objects a Cache-Control: public, max-age=3600
header. If you would like to change this header, there are a couple of options available, which are outlined below.
- Use StrikeTracker to set the Cache Expiration Method to
Relative to Ingest
. This will apply to all newly ingest objects. - Set the Cache-Control metadata value on objects. This is done on an object-by-object basis, so it's recommended for objects that should have varying Cache-Control headers.
We will cover these options in "How do I configure the Cache-Control header?" below.
No, you will need to migrate your tooling to gsutil, rclone, Cyberduck, or a similar tool.
Yes. You can find more information about GCS and S3 interoperability here.
Your cost structure will not change. You will continue to be billed the same per GB-month and operation costs.
You should have received a survey asking if you would like to be migrated, or if you would like to migrate yourself. If you have opted to be migrated by Stackpath and meet our criteria for an automated migration, we will create your GCS bucket and service account, then we will migrate your data. While the migration is in progress, we ask that you upload any new data to both your legacy storage and GCS. Once the automated migration is complete, you will no longer need to upload any new data to your legacy storage, and we will work with you on switching your origin in StrikeTracker to the new GCS bucket.
If you wish to migrate yourself, or do not meet our criteria for having an automated migration, this article provides details and options for you to migrate your data. As always, our support is available 24/7 to assist with any questions or concerns you may have when migrating.
If the upload speed is known for the machine you will be using to initiate the migration from legacy storage to GCS, you can use the transfer time calculator here.
We will not be able to directly re-write it, but our Support team will be here to help with assistance.
Overall, you should see improved performance with GCS with faster origin pulls and more stable storage.
How does legacy storage compare to GCS?
Our legacy storage is a file-based storage system with a primary location and backup location, whereas GCS is an object based cloud storage system. More information about GCS can be found here.
You can view your legacy storage content via FTP. If you need any assistance accessing your legacy storage, please reach out to our 24/7 support team.
Yes, you will be able to use the GCS API using HMAC keys. When using the API with HMAC keys, you will only be able to use the XML API, not the JSON API. Below is a link to Google's XML API documentation.
Each service account create will have full access to the entirety of the GCS bucket. These service accounts will also be the only way to authenticate to GCS.
You will not be able to manage your GCS buckets contents within StrikeTracker. The only way to manage your GCS files will be with gsutil, rclone, Cyberduck, or similar tools.
How do I access my GCS data?
Before you can access your data, you will first need to create a service account (user), if one is not created already.
GCS does not have users, but you can create a service account from StrikeTracker that will permit access to your GCS buckets.
To create a service account, you will need to go the menu and select "Object Storage".
From there, you will want to navigate to "Service Accounts" and click "Add Service Account".
Now that you have a service account created, you can click on that service account to either generate a key or HMAC key, depending on how you access your content. If you are using rclone, you will want to generate a key. If you use gsutil, you will need to generate a HMAC key. We will discuss authenticating these tools in more detail.
Creating a GCS bucket can be done within StrikeTracker. To create a new bucket you will need to click the menu and select "Object Storage".
From here you will want to select "Object Storage" from the left-hand menu (if you're not already there), and click "Add Bucket".
Here, you'll have the option to name your bucket and select the region you wish to use for that bucket.
Bucket name only supports (a-z, 0-9, -) and must start/end with a letter or number.
Once your bucket is created, you can click on it and view more details about the bucket, as well as modify it from private to public.
You will need to configure gsutil to access your bucket using your service account and your HMAC key. You can find installation instructions for gsutil here if you have not yet installed this tool.
Once gsutil is installed, you will want to run the following command and follow the instructions to grant access to your service account.
gsutil config -a
By default, gsutil will try to authenticate with OAuth2 credentials from the Cloud SDK, but this is not support by Stackpath. So you will need to run the following command to ensure the HMAC credentials are used to authenticate.
gcloud config set pass_credentials_to_gsutil false
You will need to configure rclone to access your bucket using the key file for your service account. You can find downloads and installation instructions for rclone here.
Below are instructions to configure a common GCS profile using rclone.
- First, run
rclone config
- Enter
n
to create a new config - Use the defaults for most variables, except for the following:
Variable
|
Value
|
---|---|
type | google cloud storage |
service_account_file | location to the key file that you downloaded from StrikeTracker. |
bucket_policy_only | true |
location | select the value you see in the "REGION" column in your bucket list in StrikeTracker |
storage_class | REGIONAL |
Cyberduck provides a GUI interface to navigate, upload, and delete your data. When using Cyberduck, you will need to use HMAC keys to connect. Please see "How do I create a GCS user (service account)?" above for instructions to generate an HMAC key.
- Click "Open Connection"
- Select "Amazon S3" for the service
- Update the "Server" to the
storage.googleapis.com
- The "Access Key ID" and "Secret Access Key" are the values provided when your HMAC key was generated in StrikeTracker.
Once logged in, you will see all of your GCS buckets. You can double click the bucket you wish to view or modify to enter into it.
How do I migrate my data myself?
We recommend using rclone to migrate your data from legacy storage to GCS. Rclone is a command-line tool that will allow you to migrate data from your legacy storage to your GCS buckets. This tool will require you to create two profiles for the migration: one for legacy storage and one for GCS. Once these profiles are created for each platform, you can use the sync
or copy
commands to migrate your data. We will cover details on getting your migration going in this section.
The migration speed is dependent on the upload speeds of the machine initiating the transfer. This limitation is due to how rclone migrates this data, as it acts as a proxy between the two platforms with the transferring machine acting as the bridge between the two platforms.
sync
and copy
when using rclone?
-
- sync - Sync the source to the destination, changing the destination only. Doesn't transfer unchanged files, testing by size and modification time or MD5SUM. Destination is updated to match source, including deleting files if necessary.
- copy - Copy the source to the destination. Doesn't transfer unchanged files, testing by size and modification time or MD5SUM. Doesn't delete files from the destination.
It is always the contents of the directory that are synced, not the directory. Whensource:path
is a directory, it's the contents of that source directory that are copied, not the directory name and contents. Ifdest:path
doesn't exist, it is created and thesource:path
contents are copied there.
There are some limitations with rclone when migrating millions of files which could result in high memory consumption. You can view more information about limitations with rclone here.
Configuring rclone for your migration
Configuring rclone is a simple process where you will follow the instructions of the program to configure the profiles for each service. You can find rclone installation instructions by going here.
Below are instructions to configure a common legacy storage proflie.
- First, run
rclone config
- Enter
n
to create a new config - Use the defaults for most variables, except for the following:
Variable
|
Value
|
---|---|
type | ftp |
host | upload.hwcdn.net |
user | <legacy_ftp_user> |
pass | <legacy_ftp_pass> |
You will need to configure rclone to access your bucket using the key file for your service account. You can find downloads and installation instructions for rclone here.
Below are instructions to configure a common GCS profile using rclone.
- First, run
rclone config
- Enter
n
to create a new config - Use the defaults for most variables, except for the following:
Variable
|
Value
|
---|---|
type | google cloud storage |
service_account_file | location to the key file that you downloaded from StrikeTracker. |
bucket_policy_only | true |
location | select the value you see in the "REGION" column in your bucket list in StrikeTracker |
storage_class | REGIONAL |
When performing a migration, you can either use the copy
or sync
commands. With both commands, we recommend using the -P
flag to show the progress. The <BUCKET_ID>
is the Bucket Name seen in StrikeTracker in the Bucket Details. See examples below:
- Copy
rclone copy \
<source_legacy_storage_name_in_rclone_config> \
<destination_GCS_name_in_rclone_config>:<BUCKET_ID> -P
More details about this command can be found here.
- Sync
When performing a sync, we strongly recommend performing a dry run using the -n flag, first.
rclone sync \
<source_legacy_storage_name_in_rclone_config> \
<destination_GCS_name_in_rclone_config>:<BUCKET_ID> -n
Once you've confirmed the dry run is good, you can perform the sync.
rclone sync \
<source_legacy_storage_name_in_rclone_config> \
<destination_GCS_name_in_rclone_config>:<BUCKET_ID> -P
More details about this command can be found here.
By default, rclone will not copy or sync empty directories from legacy storage. To create the empty directories from your legacy storage in your GCS bucket, you will need to use the following flag during the copy or sync:
--create-empty-src-dirs
Below are some rclone flags we feel will be helpful in your data transfer and future uploads. You can find more available flags here.
Flag
|
Flag definition
|
---|---|
-n, --dry-run | Do a trial run with no permanent changes |
--ignore-existing | Skip all files that exist on destination |
--immutable | Do not modify files. Fail if existing files have been modified. |
-i, --interactive | Enable interactive mode |
--log-file string | Log everything to this file |
--max-duration duration | Maximum duration rclone will transfer data for. |
-P, --progress | Show progress during transfer. |
--retries int | Retry operations this many times if they fail (default 3) |
--retries-sleep duration | Interval between retrying operations if they fail, e.g 500ms, 60s, 5m. (0 to disable) |
--timeout duration | IO idle timeout (default 5m0s) |
--transfers int | Number of file transfers to run in parallel. (default 4) |
-u, --update | Skip files that are newer on the destination. |
-v, --verbose count | Print lots more stuff (repeat for more) |
How do I cutover from legacy storage to GCS without any downtime?
As the migration of your files from legacy storage to GCS progresses, you will want to switch your origin from legacy storage to GCS gracefully to ensure that there is no downtime.
- Before your cutover, we recommend that over 50% of your dataset is migrated to your GCS container.
- For a graceful cutover, you will need to configure the origins for your site(s) so that GCS is the primary origin and legacy storage is the backup origin.
- With this configuration, all origin pull attempts will start by trying to request content from GCS. If the content does not yet exist on GCS, it will then fall back to legacy storage for the missing content. You should not see any decreased performance during origin pulls.
The primary origin is the first origin we will attempt to pull content from to be cached on the CDN. If we fail to retrieve the content from your primary origin (i.e. a 404 HTTP response), we will attempt to pull the content from the backup origin.
First, you will need to add your GCS bucket to your origins in StrikeTracker. This can be done by going to Origins→ Add Origin.
When adding your GCS bucket as an origin, you will want to use the External
origin type and your endpoint URL as the hostname (i.e. mybucket.storage.googleapis.com). The endpoint URL can be found in the bucket details within StrikeTracker.
Then you can update your origin connections by going to Origin → Origin Selection → Origin Connections in your site editor, and select your GCS bucket as the primary origin and your legacy storage as your backup origin.
Additionally, if it's not already configured, you should use HTTPS as the origin pull protocol.
Please note that by default, new GCS buckets will be private and will require additional signing for successful origin pulls using HMAC keys from a service account. You will also need to configure the AWS Signed OriginPull V4 policy. This will sign URLs during origin pulls so the content can be pulled and cached on our CDN. This policy can be configured by going to your site editor and navigating to Origin → Uncategorized → AWS Signed OriginPull V4. Below is the configuration template you will need to use to allow access from the CDN to GCS to pull and cache content.
You will also need to add the following signed headers in this policy:
- host
- x-amz-content-sha256
- x-amz-date
You can find Google documentation on the V4 signing here.
You should continue to upload new content to both platforms until you perform the graceful cutover outlined above. Once the cutover is completed, you will no longer need to upload new content to your legacy storage.
How do I complete common tasks in GCS?
In this section, we will cover common tasks for managing your GCS content, including listing files, uploading files, deleting files, etc.
You will be able to create service accounts and interact directly with the GCS API to upload data, modify permissions, etc. To create a service account, please see "How do I create a GCS bucket?" above.
Once a bucket is created, you can click on that bucket and modify the permissions so that the bucket is either private, or public under the "Visibility" policy.
Listing files in your GCS bucket can be done using gsutil, rclone, or by logging into the bucket via Cyberduck. In this section, we will cover listing files via gsutil and rclone.
gsutil
List all contents in the top-level directory. This will list all files and directories in the top-level directory, but not the files in the subdirectories. The <BUCKET_ID>
is the Bucket Name seen in StrikeTracker in the Bucket Details.
gsutil ls gs://<BUCKET_ID>
List all contents in a subdirectory
gsutil ls gs://<BUCKET_ID>/<SUBDIRECTORY>/
Wildcard match subdirectories and list all content in them (i.e. /example1/, /example2/, /example3/, etc.)
gsutil ls gs://<BUCKET_ID>/example*/
You may need to use double quotes around gs://<BUCKET_ID>/example*/
List all contents using a wildcard (i.e. all .txt files)
gsutil ls gs://<BUCKET_ID>/*.txt
You may need to use double quotes around gs://<BUCKET_ID>/*.txt
Recursively list all contents in the top-level directory. This will list the top-level objects and buckets, then the objects and buckets under gs://<BUCKET_ID/example1
, then those under gs://BUCKET_ID/images2
, etc.
gsutil ls -r gs://<BUCKET_ID>
To Recursively list all contents in the top-level directory or a subdirectory in a list format, we can use the following
gsutil ls -r gs://<BUCKET_ID>/**
gsutil ls -r gs://<BUCKET_ID>/<SUBDIRECTORY>/**
You may need to use double quotes around gs://<BUCKET_ID>/**
Print the object size, creation time stamp, and name of the object
gsutil ls -l gs://<BUCKET_ID>/<FILENAME>
This can also be done with a wildcard to match a pattern (i.e. *.txt)
gsutil ls -l gs://<BUCKET_ID>/*.txt
You may need to use double quotes around gs://<BUCKET_ID>/*.txt
Using the "-L" flag, you can print additional details about objects and buckets.
gsutil ls -L gs://<BUCKET_ID>/<SUBDIRECTORY>
rclone
There are several commands you can use to list content in your GCS buckets using rclone. The <BUCKET_ID>
is the Bucket Name seen in StrikeTracker in the Bucket Details.
-
ls
- list size and path of objects onlylsl
- list modification time, size, and path of objects onlylsd
- list directories onlylsf
- list objects and directories in easy to parse formatlsjson
- list objects and directories in JSON format
By default, "ls" and "lsl" are recursive, but you can use the "--max-depth 1" flag to stop the recursion. The "lsd", "lsf", and "lsjson" commands are not recursive by default, but you can use the "-R" flag to make them list recursively.
Below are some example commands
rclone ls <GCS_name_in_rclone_config>:<BUCKET_ID>
rclone ls --max-depth 1 <GCS_name_in_rclone_config>:<BUCKET_ID>/<SUBDIRECTORY>
rclone lsl <GCS_name_in_rclone_config>:<BUCKET_ID>/<SUBDIRECTORY>
rclone lsd <GCS_name_in_rclone_config>:<BUCKET_ID>
rclone lsf -R <GCS_name_in_rclone_config>:<BUCKET_ID>
rclone lsjson <GCS_name_in_rclone_config>:<BUCKET_ID>/<SUBDIRECTORY>
To upload files to your GCS bucket, you can use gsutil, rclone, or Cyberduck.
gsutil
Using gsutil, the cp
and rsync
commands can be used to copy/sync content to your GCS bucket. You can find more details about using cp
here. The <BUCKET_ID>
is the Bucket Name seen in StrikeTracker in the Bucket Details.
gsutil cp /path/to/local/files gs://<BUCKET_ID>/
You can find more details about using rsync
here.
gsutil rsync /path/to/local/files gs://<BUCKET_ID>/
rclone
See above on how to configure GCS in rclone. Once configured, you can transfer data from your local machine/storage server using the the copy
or sync
commands. The <BUCKET_ID>
is the Bucket Name seen in StrikeTracker in the Bucket Details.
You can find more information about using the copy
command here.
rclone copy \
/path/to/file \
<destination_GCS_name_in_rclone_config>:<GCS_bucket> -P
You can find more information using the sync
command here.
rclone sync \
/path/to/file \
<destination_GCS_name_in_rclone_config>:<GCS_bucket> -P
Deleting files from your GCS bucket can be done through gsutil or rclone.
Deleting files with gsutil is done using the rm
command. More details on deleting files from GCS via gsutil can be found here. The <BUCKET_ID>
is the Bucket Name seen in StrikeTracker in the Bucket Details.
Delete individual files
gsutil rm gs://<BUCKET_ID>/<FILENAME>
Delete all files in a directory, but not in subdirectories (i.e. delete only files in /example1/, but not in /example1/subdirectory1/)
gsutil rm gs://<BUCKET_ID>/example1/*
You may need to use double quotes around gs://<BUCKET_ID>/example1/*
Delete all files in a directory AND all subdirectories
gsutil rm gs://<BUCKET_ID>/example1/**
gsutil rm -r gs://<BUCKET_ID>/example1
You may need to use double quotes around gs://<BUCKET_ID>/example1/**
If you have a large amount of content to be deleted, you can use the "-m" flag to perform parallel deletions
gsutil -m rm -r gs://<BUCKET_ID>/example1
You can also delete content from a list if you have a large amount of specific content to remove
file_list | gsutil -m rm -I
The list must be formatted with the GCS URLs and wildcards of GCS URLs. For example:
gs://<BUCKET_ID>/file
gs://<BUCKET_ID>/example/file
gs://<BUCKET_ID>/example/*
There are a couple of commands you can use to delete content in your GCS buckets using rclone. Since these commands could result in data loss, we recommend running a dry-run using the --dry-run
or –interactive / -i
flags first.
To configure metadata for your GCS buckets, you will need to create a .json file locally containing the metadata you wish to apply to your bucket. Once your .json file is created you can use either gsutil or the REST API to apply the metadata to your bucket. More details about using metadata with GCS can be found here. Below you can find common metadata configurations.
Description
|
GCS Equivalent documentation
|
---|---|
Origins to be allowed to make Cross Origin Requests, space separated. | https://cloud.google.com/storage/docs/configuring-cors |
Grants the ability to perform GET and HEAD operations on objects within a container. | https://cloud.google.com/storage/docs/access-control/lists |
Grants the ability to perform PUT, POST and DELETE operations on objects within a container. | https://cloud.google.com/storage/docs/access-control/lists |
Determine the index file (or default page served, such as index.html) for your website. | https://cloud.google.com/storage/docs/static-website |
The object lifecycle policy in GCS will allow you to execute file deletion when conditions are met. Please note that this policy applies to the entire bucket, not individual files.
When configuring the lifecycle of the bucket, there are several sets of rules that can be used as criteria to meet before modifying or deleting files.
- Age
- CreatedBefore
- CustomTimeBefore
- DaysSinceCustomTime
- DaysSinceNoncurrentTime
- IsLive
- MatchesStorageClass
- NoncurrentTimeBefore
- NumberOfNewerVersions
You can configure object lifecycles via gsutil and the REST API. In both instances, you need to create a .json file locally containing the lifecycle rules before applying the configuration to the bucket. An example is provided below, which automatically deletes files after 30 days.
lifecycle.json
{
"lifecycle": {
"rule": [
{
"action": {
"type": "Delete"
},
"condition": {
"age": 30,
"isLive": true
}
}
]
}
}
Once your .json file is created, you can apply the configuration. In this example, we named the file "lifecycle.json"
gsutil
The <BUCKET_ID>
is the Bucket Name seen in StrikeTracker in the Bucket Details.
gsutil lifecycle set lifecycle.json gs://<BUCKET_ID>
REST API
The <BUCKET_ID>
is the Bucket Name seen in StrikeTracker in the Bucket Details.
curl -X PATCH --data-binary @lifecycle.json \
-H "Authorization: Bearer OAUTH2_TOKEN" \
-H "Content-Type: application/json" \
"https://storage.googleapis.com/storage/v1/b/<BUCKET_ID>?fields=lifecycle"
CDN Configuration Recommendations and Important Notes
Here, we will cover some of our CDN configuration recommendations and important notes once you have migrated to GCS.
The Cache-Control header can be customized using StrikeTracker or metadata. If your dataset should have the same Cache-Control values, you can use StrikeTracker to configure the header. However, if your dataset requires varying Cache-Control values for your files, you will need to utilize metadata on GCS to configure the values using gsutil.
StrikeTracker
To configure the Cache-Control in StrikeTracker, you will need to navigate to the site editor for the site you wish to update. Once there, go to Cache → Cache Settings → CDN Caching. There you will want to change the Cache Expiration Method to "Relative to ingest" and configure the CDN TTL to the number of seconds you wish to cache your content.
gsutil
Using gsutil, you can use the setmeta
command to set headers for specific buckets and files, including the Cache-Control header. The <BUCKET_ID>
is the Bucket Name seen in StrikeTracker in the Bucket Details. Below is an example to set the Cache-Control header for your bucket:
gsutil setmeta -r -h "Cache-control:public, max-age=3600" \
gs://<BUCKET_ID>
You can also configure headers on a file-by-file basis, or by file extension. Below is an example of each:
gsutil setmeta -r -h "Cache-control:public, max-age=3600" \
gs://<BUCKET_ID>/object
gsutil setmeta -r -h "Cache-control:public, max-age=3600" \
gs://<BUCKET_ID>/*.html
You can find more information and examples for setting headers on your GCS bucket using metadata here.
By default, your GCS bucket is private upon creation.
First, you will need to add your GCS bucket to your origins in StrikeTracker. This can be done by going to Origins→ Add Origin.
When adding your GCS bucket as an origin, you will want to use the External
origin type and your endpoint URL as the hostname (i.e. mybucket.storage.googleapis.com). The endpoint URL can be found in the bucket details within StrikeTracker.
Now that the origin is configured, you can update your site to pull from that GCS bucket. To do this, navigate to the site editor for the site you wish to pull from GCS and go to Origin → Origin Selection. Here you will need to update the Origin Connections to your GCS bucket as the primary origin.
Additionally, if it's not already configured, you should use HTTPS as the origin pull protocol.
If your bucket is configured with public permissions, you have successfully added your GCS bucket as an origin. However, if your bucket is private, which is the default behavior, you will need to configure the AWS Signed OriginPull V4 policy. This will sign URLs during origin pulls so the content can be pulled and cached on our CDN. This policy can be configured by going to your site editor and navigating to Origin → Uncategorized → AWS Signed OriginPull V4. Below is the configuration template you will need to use to allow access from the CDN to GCS to pull and cache content.
You will also need to add the following signed headers in this policy:
- host
- x-amz-content-sha256
- x-amz-date
You can find Google documentation on the V4 signing here.
When making your GCS bucket private, it will lose the default Cache-Control headers. If you wish to use Cache-Control headers, you will need to configure it in StrikeTracker, or via gsutil with metadata. Please refer to "How do I configure the Cache-Control header?" above.