1. Testing The Framework

This section is intended for framework developers. After modifying the framework’s core code, you can test it using mocked datasets by executing these commands:

# Generate mocked datasets:
# - Creates "_mocked_benchmark" case set in "data/case_set/"
# - Creates "_mocked_samples" sample set in "data/sample_set/"
# - Creates "_mocked_eval_task" evaluation task in "data/eval_task/"
python mock.py -case -sample -task

# Execute evaluation (mock mode):
# 1. Tasks prefixed with "_mocked_" will run in mock mode
# 2. Runtime profiles will be generated in "data/eval_task/task_running/"
# 3. Final results will be stored in "data/result/"
python evaluate.py _mocked_eval_task

# Cleanup mocked assets (optional):
# Note: This removes only the mocked datasets/tasks, not runtime profiles/results
python mock.py clean -case -sample -task

2. Creating The Case Set

Before evaluating, a case set must be created unless you have a case set already. Create a case set by the following command:

# The name "my_benchmark" should be replaced with an appropriate name.
# Note: This command will ask you for the tile and description of the new case set, you can input any text for them.
python create_case_set.py my_benchmark

3. Making A Case

This section provides an example of how to create cases of creating resource.

3.1. Creating Workshop Directory

Create a directory named workshop_of_case.

3.2. Initializing Environment

Navigate into the workshop_of_case directory and create an HCL source file - e.g., main.tf as shown below:

# A provider block is required. You can specify the provider requirement block if needed.
provider "alicloud" {
  region = "cn-shanghai"
}

Next, run the init and apply commands to initialize the environment:

# The init command downloads the provider plugins and creates the .terraform.lock.hcl file.
tofu init

# The apply command generates the Terraform state file.
tofu apply

At this point, the environment is initialized and ready for resource creation.

3.3. Saving The Pre-Operation Environment

Run the save_env.py script to capture the environment state before performing any operations:

python save_env.py pre_op

This will create a directory named pre_op inside the workshop directory, containing the necessary files to represent the pre-operation environment state.

3.4. Creating An OSS Bucket

Update the main.tf file as follows:

provider "alicloud" {
  region = "cn-shanghai"
}

# Declare an OSS bucket.
resource "alicloud_oss_bucket" "demo_bucket" {
  bucket = "demo-bucket-2025-0507-1348"
  acl = "public-read"
}

Then, run the apply command to create the OSS bucket:

tofu apply

3.5. Saving The Post-Operation Environment

Run the save_env.py script to capture the environment state after performing any operations:

python save_env.py post_op

This will create a directory named post_op inside the workshop directory, containing the necessary files to represent the post-operation environment state.

3.6. Preparing A Manifest File

Create a file named manifest.yaml in the workshop directory to define the test cases. Below is an example:

resource_type: alicloud_oss_bucket
operation_type: create
singularity: s0
essentials:
  - >
    count(alicloud_oss_bucket) == 1
  - >
    opt("alicloud_oss_bucket_acl") is None or count(alicloud_oss_bucket_acl) == 1
  - >
    provider_of(first_of(alicloud_oss_bucket)).region == "cn-shanghai"
  - >
    opt("alicloud_oss_bucket_acl") is None or
    provider_of(first_of(alicloud_oss_bucket_acl)).region == "cn-shanghai"
  - >
    first_of(alicloud_oss_bucket).bucket == "demo-bucket-2025-0507-1348"
  - >
    len(list(filter(
        lambda x: x is not None,
        {
            first_of(alicloud_oss_bucket).opt("acl"),
            first_of(opt("alicloud_oss_bucket_acl"))
        }
    ))) == 1
  - >
    first_of(alicloud_oss_bucket).opt("acl") == "public-read" or
    opt("alicloud_oss_bucket_acl") is not None and
    first_of(alicloud_oss_bucket_acl).opt("acl") == "public-read"
  - >
    opt("alicloud_oss_bucket_acl") is None or first_of(alicloud_oss_bucket_acl).bucket ==
    "${" + first_of(alicloud_oss_bucket).self_path() + ".bucket}"
  - >
    first_of(alicloud_oss_bucket).opt("storage_class") in {None, "Standard"}
  - >
    first_of(alicloud_oss_bucket).opt("lifecycle_rule") is None
actions:
  side_effect: deny
  required:
    - action: create
      pattern: alicloud_oss_bucket.*
      count: 1
  optional:
      - action: create
        pattern: alicloud_oss_bucket_acl.*
        count: 1
      - action: create
        pattern: alicloud_oss_bucket_versioning.*
        count: 1
security: []
misc:
  - >
    alicloud_oss_bucket.opt("demo_bucket") is not None
  - >
    alicloud_oss_bucket.demo_bucket.opt("versioning") is None or
    alicloud_oss_bucket.demo_bucket.versioning.opt("status") == "Suspended"
  - >
    opt("alicloud_oss_bucket_versioning") is None or
    first_of(alicloud_oss_bucket_versioning).bucket ==
    "${" + first_of(alicloud_oss_bucket).self_path() + ".bucket}" and
    first_of(alicloud_oss_bucket_versioning).opt("status") in {None, "Suspended"}
  - >
    alicloud_oss_bucket.demo_bucket.opt("cors_rule") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("website") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("logging") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("server_side_encryption_rule") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("transfer_acceleration") is None
  - >
    alicloud_oss_bucket.demo_bucket.opt("redundancy_type") is None or
    alicloud_oss_bucket.demo_bucket.redundancy_type == "LRS"
  - >
    alicloud_oss_bucket.demo_bucket.opt("access_monitor") is None or
    alicloud_oss_bucket.demo_bucket.access_monitor.opt("status") == "Disabled"
user_inputs:
  i3: >
    Create an Alibaba Cloud OSS bucket in the cn-shanghai region.
    Name the bucket demo-bucket-2025-0507-1348, and set the resource block name to demo_bucket.
    Set the ACL to public-read.
    Do not enable versioning or lifecycle management.
  i2: >
    Create an Alibaba Cloud OSS bucket in the cn-shanghai region.
    Name the bucket demo-bucket-2025-0507-1348.
    Set the ACL to public-read.
  i1: >
    Create an OSS bucket named demo-bucket-2025-0507-1348 in Shanghai that anyone can read.
  i0: >
    Create an OSS bucket.
clarity: [c1, c0]

Each case is defined by a combination of five dimensions: resource type, operation type, integrity, clarity, and singularity. Given the combinations specified in this manifest.yaml, up to 8 unique cases can be described.

3.7. Saving The Cases To The Case Set

Use the save_cases.py script to save the cases defined in the workshop directory to a specified case set:

# "my_benchmark" refers to the name of an existing case set.
python save_cases.py my_benchmark

After saving the case, restore the real environment to the state captured in the “pre_op” snapshot.

Now that the cases have been created, you can repeat the steps above to add more cases as needed.

4. Setting Samples and Evaluation Task

Create a sample set file in data/sample_set. e.g. samples_testing.yaml:

kind: sample_set
sample_set:
  title: Sample For Testing
  description: This is the sample set is for testing.
  samples:
    - [ qwen-max-2025-01-25, builtin_agent.d1.e1 ]
    - [ deepseek-chat, builtin_agent.d0.e1 ]

In the samples section of the YAML file, each sample is a pair consisting of a model name and a decorated agent name.

The model name refers to a specific LLM.
The decorated agent name includes the agent’s name and two feature flags:
- “d0”: the agent does not support documentation.
- “d1”: the agent does support documentation.
- “e0”: the agent is not aware of the IaC environment.
- “e1”: the agent is aware of the IaC environment.
- “es”: the agent is only aware of the IaC state.
- “ec”: the agent is only aware of the IaC code (HCL).

The builtin_agent is a special type of agent that can enable or disable specific features using feature flags.

At last, create an evaluation task file in data/eval_task, e.g. eval_testing.yaml:

kind: eval_task
eval_task:
  case_set: my_benchmark
  sample_set: samples_testing
  repeats: 3
  iac_tool:
    name: tofu
    version: 1.9.0
    path: /opt/homebrew/bin/tofu

Key fields explained:

case_set: The name of the case set to be evaluated.
sample_set: The name of the sample set to be used in this evaluation.
repeats: The number of times each case will be evaluated.
iac_tool: Configuration for the Infrastructure-as-Code (IaC) tool, including:
- name: The name of the IaC tool (e.g., tofu)
- version: The version to be used
- path: The full path to the executable

5. Before Evaluating

Before running the evaluation, make sure the following preparations are complete:

Provider credentials: Store the credentials for the provider plugins in environment variables (e.g., ALICLOUD_ACCESS_KEY, ALICLOUD_SECRET_KEY).

Configuration file: Create or update the config.yaml file. Example:

eval_concurrency: 4
models:
  - name: qwen-max-2025-01-25
    endpoint: https://dashscope.aliyuncs.com
    base_url_path: /compatible-mode/v1
    secret_key_env_var: LLM_SK
  - name: qwen-plus-2025-01-25
    endpoint: https://dashscope.aliyuncs.com
    base_url_path: /compatible-mode/v1
    secret_key_env_var: LLM_SK
  - name: qwen-turbo-2025-02-11
    endpoint: https://dashscope.aliyuncs.com
    base_url_path: /compatible-mode/v1
    secret_key_env_var: LLM_SK
  - name: deepseek-chat
    endpoint: https://api.deepseek.com
    base_url_path: /v1
    secret_key_env_var: DEEPSEEK_SK
  - name: qwq-plus
    endpoint: https://dashscope.aliyuncs.com
    base_url_path: /compatible-mode/v1
    secret_key_env_var: LLM_SK
builtin_agent:
  default_model: qwen-max-2025-01-25
informer_agent:
  c1_model: qwq-plus
  c0_model: qwq-plus

Custom agents: If you’re using agents other than builtin_agent, you must implement corresponding adapters and register them in adapter/factory.py.

6. Evaluating

To start the evaluation, run evaluate.py with the task name:

python evaluate.py eval_testing

If the evaluation is interrupted, you can resume it by running resume.py with the same task name:

python resume.py eval_testing

7. Analyzing The Result

After the evaluation completes, a result file will be generated in the data/result directory. You can analyze the results as needed based on your specific criteria.