AWS CodeArtifact Package Origin Control toolkit

Overview

AWS CodeArtifact is a fully managed artifact repository service that makes it easy for organizations to securely store and share software packages used for application development. On [launch date] we introduced a feature called “Package Origin Control” which allows customers to protect themselves against “dependency substitution“ or “dependency confusion“ attacks.

While this feature protects new packages by default, packages which lived in CodeArtifact repositories prior to the feature release are not protected without explicit configuration.

The purpose of this toolkit is to provide repository administrators with an easy way to set Origin Control policies in bulk on packages that have not received the default protection because they pre-date feature release. This can be achieved by blocking upstream versions for internal packages. The toolkit also supports blocking publishing package versions to avoid the creation of a potentially vulnerable mixed state for external packages as well.

More information can be found on the origin control feature documentation as well as in the blog post announcing the availability of this toolkit.

Structure

The toolkit is comprised of two scripts: a first one called generate_package_configurations.py for creating a manifest file listing the packages in a domain alongside their proposed origin configuration to apply, and a second one named apply_package_configurations.py that reads the manifest file and applies the configuration within.

generate_package_configurations.py can operate on a whole repository, or on a subset of packages (specified either via filters, or though a list) and supports two origin control resolution modes:

Manual: Supply the origin configuration yourself via a manifest file. This is an appropriate option if you already maintain a list of internal packages, or if they are published in a consistent internal namespace which allows for them to be easily selected.
Automated: Identifies which packages should have their upstreams blocked by analyzing the upstream repository graph and external connections, looking for evidence that package versions are only available from the repository at hand- in which case it determines it can disable sourcing of upstream versions can be done without risk of breaking builds. This is a good option if you want a quick way to tighten your security posture without having to manually analyze your whole repository.

apply_package_configurations.py takes the manifest file generated by generate_package_configurations.py as in input, and applies the origin control changes by calling the new PutPackageOriginConfiguration API.

Precisely because it is meant to set these values in bulk, this script supports backup and revert operations by default, as well as dry-run and step-by-step confirmation options. If you identify an issue after applying origin control changes, you will be able to safely revert to the original, working configuration before trying again. See the Backup and restore section for details.

Installing

The toolkit only depends on the boto3 and tqdm packages. In order to install, simply run:

pip install -r requirements.txt

Configuring

The toolkit uses the same configuration as the AWS CLI to run. This means that you can either set one the following two environment variable sets:

AWS_PROFILE

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN

Alternatively, you can use the --profile flag to indicate what specific AWS CLI profile you want to use. Please note that this flag is only used for authentication purposes and thus even if you have specified a region parameter in your AWS CLI profile, you will still be required to pass in a --profile flag to the script.

Additionally, the account you are using to authenticate must have at least repository-level read permissions to run the first stage in manual mode, and read permissions on all repositories upstream of the target repository in auto mode. Stage 2 requires repository read permissions if the backup feature is enabled (default) as well as write permissions to execute unless you want to use dry-run mode(see the “More Options” section below).

Using

The toolkit works on a per-repository basis. It is structured in two stages: in the first one a manifest is produced consisting of a CSV listing all the packages in the target repository, alongside their desired origin control configuration. The second stage is responsible for taking the generated CSV and setting the desired origin control configuration on every package listed within.

Stage 1: Generating the changes manifest

The first stage is invoked through generate_package_configurations.py. It requires values for domain, repository as well as region to be supplied.

Specifying origin configuration

Origin configuration is always supplied as as string like

publish=[value],upstream=[value]

where [value] can be either ALLOW or BLOCK. So by default all existing packages will have

publish=ALLOW,upstream=ALLOW

In order to tighten security for an internally-published package, you would want to disable upstream versions like

publish=ALLOW,upstream=BLOCK

Conversely, if you wanted to prevent users from publishing new versions to a package, you would set:

publish=BLOCK,upstream=ALLOW

These settings are always supplied as a tuple and should be thought of as working in concert.

Generating from list vs generating from query

Once the repository, domain, and region values have been supplied, you must select which packages to generate origin control configurations for. It is possible to select either all packages within the repository, or a subset.

In order to select a subset of packages available in the repository, two options are available: either by supplying a list of package names or through a query. Please note that multiple namespaces and package formats aren’t supported at once and you will have to repeat this operation explicitly for each one.

Working with a supplied packages list is as easy as specifying the input file name, which should have one name per line. For example, if you wanted to BLOCK upstreams for some internal npm packages as listed in an inputfile.log file:

internal-package-1
internal-package-2
internal-package-3

You would call the first stage script:

python generate_package_configurations.py 
--region us-west-2  
--domain test-domain 
--repository test-repository 
--from-list inputfile.csv
--format npm

Alternatively, you can select the packages in question:

python generate_package_configurations.py 
--region us-west-2  
--domain test-domain 
--repository test-repository 
--format maven
--namespace example-namespace

Automatic vs. manual origin control setting

Once you selected a package set, you have two ways of bulk-setting the origin configuration for each package in it: the simplest is by explicitly setting the policy via the --set-restrictions flag, which we refer to as “manual” mode.

For example

python generate_package_configurations.py
--region us-west-2 
--domain test-domain 
--repository test-repository 
--format pypi 
--prefix some-prefix 
--set-restrictions upstream=ALLOW,publish=BLOCK

Otherwise, you can use “automatic” mode simply by omitting the above flag. This mode is meant for administrators who want the most hassle-free experience: the toolkit will try to identify packages which can have their upstreams blocked safely, and will otherwise fall back on allowing upstreams.

The heuristic will block acquisition of new versions from upstreams if and only if the target repository doesn’t have direct access to an external connection, and no versions of the package are available via any of the upstreams, either because the target repository doesn’t have any upstreams or because none of the upstreams have the package. Therefore, we assume there isn’t an immediate external connection attached to the repository for the package format(s) you are trying to run this script against.

In order to generate the list of new origin control configuration for the same subset of packages as in the previous example, simply omit the --set-restrictions flag and run:

python generate_package_configurations.py
--region us-west-2 
--domain test-domain 
--repository test-repository 
--format pypi 
--prefix some-prefix

Saving to a file

By default the script will save its results to a file called origin_configuration.csv. You can use the --output flag to change this to a path of your liking.

Stage 2: Applying changes

Once a well-formed CSV has been produced, it can be fed to the second stage, apply_package_configurations.py.

The same parameters as before (region, domain, repository) must be provided even though they are also present in the CSV columns. This is to ensure there is no ambiguity and to confirm you are operating on the right repository.

Invoking the second stage on origin_configuration.csv therefore looks like this:

python apply_package_configurations.py 
--region us-west-2  
--domain test-domain 
--repository test-repository 
--input origin_configuration.csv

More options

--validate-only: Verifies that the CSV is well-formed
--dry-run: Doesn’t actually call the API, but shows what the script would do.
--trace: Enables a more verbose mode.
--list-failed : In case of failure, lists packages that have failed to update the origin control configuration.
--retry-failed : Tries again to set the origin control configuration for packages that have failed to do so.
--ask-confirmation: Requires step-by-step confirmation for all write actions.
--num-workers: Controls the number of parallel workers making calls to CodeArtifact (default: 4)

Backup and restore

By default, before changing any origin control configurations the script will back up the existing configuration for every package it touches (this behavior can be disabled with the --no-backup flag). Should you want to revert to the previous configuration, you can simply use the --revert flag on the same input file.

python apply_package_configurations.py 
--region us-west-2  
--domain test-domain 
--repository test-repository 
--input origin_configuration.csv 
--restore

License

This software is released under the Apache 2.0 license.