BOSH Diego Performance Release

This is a release to measure the performance of Diego. See the proposal here.

Usage

Note: To deploy with a cf-deployment style manifest using BOSH 2.0, include the ops file under operations/add-diego-perf-release.yml. You will also need to modify the ops file to use your local copy of diego-perf-release:

value:
  path: /releases/-
  value:
    name: diego-perf-release
    url: file://<path-to-workspace>/diego-perf-release
    version: create

Prerequisites

Deploy diego-release, cf-release. To deploy this release, create a BOSH deployment manifest with as many pusher instances as you want to use for testing.

Running Fezzik

bosh ssh stress_tests 0
Run /var/vcap/jobs/caddy/bin/1_fezzik multiple times.
Output is stored in /var/vcap/packages/fezzik/src/github.com/cloudfoundry-incubator/fezzik/reports.json

Running Cedar

Automatically Running 10 Batches of Cedar (Preferred)

The steps mentioned in the previous section are automated by the ./cedar_script. The script will push 10 batches of apps each in its own spaces. Details below on how to run it:

Run cd /var/vcap/jobs/cedar/bin.
Run the following command to run the experiment:
```
./cedar_script
```
To resume the experiment from the nth batch (where n is a number from 1 to 10), add n as an argument to the script. For example, to run from the fourth batch:
```
./cedar_script 4
```

Note: if the spaces are already present from a previous run of the script, the script will not fail and will instead continue to push to those existing spaces. Manually delete spaces or the entire CF org if required.

This script also then pushes an extra batch of apps via cedar and monitors them with arborist. The file /var/vcap/sys/log/cedar/cedar-arborist-output.json contains the results from that cedar run, and the file /var/vcap/sys/log/arborist/arborist-output.json contains the arborist results.

The script will also output the min/max timestamp for each batch in /var/vcap/data/cedar/min-<batch#>.json and /var/vcap/data/cedar/max-<batch#>.json.

Running Cedar from a BOSH deployment

Run ./scripts/generate-deployment-manifest and deploy diego-perf-release with the generated manifest. If on BOSH-Lite, you can use ./scripts/generate-bosh-lite-manifests.
Run bosh ssh to SSH to the cedar VM in the cf-warden-diego-perf deployment.
Run sudo su.

Run the following commands:

# put the CF CLI on the PATH
export PATH=/var/vcap/packages/cf-cli/bin:$PATH

# target CF and create an org and space for the apps
cf api api.bosh-lite.com --skip-ssl-validation
cf auth admin admin
cf create-org o
cf create-space cedar -o o
cf target -o o -s cedar

cd /var/vcap/packages/cedar

/var/vcap/packages/cedar/bin/cedar \
  -n 1 \
  -k 2 \
  -payload /var/vcap/packages/cedar/assets/temp-app \
  -config /var/vcap/packages/cedar/config.json \
  -domain bosh-lite.com \
  &

Running Cedar Locally

Target a CF deployment.
Target a chosen org and space.
From the root of this repo, run cd src/code.cloudfoundry.org/diego-stress-tests/cedar/assets/stress-app.
Precompile the stress-app to assets/temp-app by running GOOS=linux GOARCH=amd64 go build -o ../temp-app/stress-app.
Run cd ../.. to change back to src/code.cloudfoundry.org/diego-stress-tests/cedar.
Run go build to build the cedar binary.

Run the following to start a test:

./cedar -n <number_of_batches> -k <max_in_flight> [-tolerance <tolerance-factor>]

Run ./cedar -h to see the list of options you can provide to cedar. One of the most important options is a JSON-encoded config file that provides the manifest paths for the different apps being pushed. The default config.json can be found here.

Run Arborist from a BOSH deployment

Note: Arborist depends on a successful cedar run, as it uses the output file from cedar as an input.

Run the example below to monitor apps on a BOSH-Lite installation:

Run ./scripts/generate-bosh-lite-manifests and deploy diego-perf-release with the generated manifest.
Run bosh ssh to SSH to the cedar VM in the cf-warden-diego-perf deployment.
Run sudo su.
Run the following commands to run arborist from a tmux session: ```bash
start a new tmux session
/var/vcap/packages/tmux/bin/tmux new -s arborist

cd /var/vcap/packages/arborist

/var/vcap/packages/arborist/bin/arborist
-app-file
-duration 10m
-logLevel info
-request-interval 10s
-result-file output.json &

1. To detach from the `tmux` session, send `Ctrl-b d`.
1. To reattach to the `tmux` session, run `/var/vcap/packages/tmux/bin/tmux attach -t arborist`.

### Run Arborist Locally

1. cd to `src/code.cloudfoundry.org/diego-stress-tests/arborist`
1. Build the arborist binary with `go build`.

1. Run the following to start a test:
```bash
./arborist \
  -app-file <cedar-output-file> \
  -duration 10m \
  -logLevel info \
  -request-interval 10s \
  -result-file output.json

Arborist has the following usage options:

  -app-file string
        path to json application file
  -domain string
        domain where the applications are deployed (default "bosh-lite.com")
  -duration duration
        total duration to check routability of applications (default 10m0s)
  -logLevel string
        log level: debug, info, error or fatal (default "info")
  -request-interval duration
        interval in seconds at which to make requests to each individual app (default 1m0s)
  -result-file string
        path to result file (default "output.json")

Monitoring the cluster

The team has created three grafana dashboards that include graphs to monitor interesting metrics. Below are the names and description of each one of those dashboards:

aggregation/bosh_influxdb_dashboard.json

System metrics (i.e. cpu usage, system load and disk usage) across the entire cluster
aggregation/diego_influxdb_dashboard.json

Diego metrics, e.g. bbs api latency, bbs requests/s, etc.
aggregation/golang_stats_influxdb_dashboard.json

Golang metrics (i.e. number of goroutines, gc pause, etc.)

Importing dashboard

To import any/all of those dashboards. From the home page:

Click on Home (or the dashboard search dropdown)
Click on Import
Choose a file
Save the dashboard (CTRL+S or the drive icon next to the dashboard dropdown)

See grafana export/import doc for more info

Exporting dashboard

To export a dashboard after editing it, do the following:

Go to the dashboard you want to export (by clicking the name in the dashboard dropdown)
Click the Manage dashboard button (the gear icon next to the dashboard dropdown)
Click Export
Grafan will automatically download the json file

See grafana export/import doc for more info

Aggregating results

Preprocessing using perfchug

perfchug is a tool that ships with the diego-perf-release. It takes log output from cedar, bbs and auctioneer, processes it, and converts it into something that can be fed into InfluxDB.

To use perfchug locally:

cd <path>/diego-perf-release/src/code.cloudfoundry.org/diego-stress-tests/perfchug.
Run go install to build the executable.
Move the executable into your $PATH.

Once on the $PATH, supply lager-formatted logs to perfchug on its stdin.

For example:

cat /var/vcap/sys/log/cedar/cedar.stdout.log | perfchug

will emit influxdb-formatted metrics to stdout.

Automatic downloading and aggregation

We wrote a script to automate the entire process. The script does the following:

Download brain, bbs & cedar job logs using bosh
Reduce the logs to the start/end timestamps of the experiments ran
Merge the logs from all jobs together
Run perfchug on the resulting log file
Insert the output of perfchug into influxdb
Run a fixed set of queries to get percentiles of requests latency among other interesting metrics

In order to use the script, you need to do the following:

You are on a jump box inside the deployment, e.g. director
You are bosh targeted to the right environment
You have perfchug, veritas and bosh on your PATH
Create a new directory and cd into it. This will be used as the working directory for the script. BOSH logs will be downloaded in this directory.

From that directory run:

/path/to/diego_results.sh \
 http://url.to.influxdb:8086 \
 /path/to/diego/manifest \
 /path/to/perf/manifest \
 [/path/to/output/file]\

The output file will contain one line per query. All query results are valid json. If there are no data points in InfluxDB, e.g. no failures, InfluxDB will result an empty result, e.g. {"results":[]}

If the output file parameter is provided, diego_results.sh will also trigger a post-processing script that condenses the output into metrics.csv, a more human-readable format.

Snapshotting and Restoring Influxdb (GCP Only)

Snapshotting

Go to the google cloud platform dashboard, and find the influxdb instance.
Find the ‘additional disks’ section, and click on the disk to be snapshotted.
Click ‘Create Snapshot’ at the top of the window that opens up.
Name the snapshot and click ‘Create’.

Restoring a snapshotted InfluxDB

Go to the google cloud platform dashboard, and find the influxdb instance.
Click edit at the top of the page.
Find the ‘additional disks’ section, and add a disk from the snapshot.
Click save at the bottom of the page. The new disk will appear as /dev/sd[a-z] (where [a-z] is the next available letter for a disk name).
Edit /etc/mtab on the influxdb vm to add the new filesystem from /dev/sd[a-z] to /var/vcap/store2.
Run mkdir -p /var/vcap/store2 && cd /var/vcap/store2 && mount /dev/sd[a-z]1.
Edit all references to /var/vcap/store -> /var/vcap/store2 in /var/vcap/jobs/influxdb.
Restart influxdb with monit restart influxdb.

Development

These tests are meant to be run against a real IaaS. However, it is possible to run them against BOSH-Lite during development. A deployment manifest template is in templates/bosh-lite.yml. Use spiff to merge it with a director_uuid stub.