This section is intended for framework developers.
After modifying the framework’s core code, you can test it using mocked datasets by executing these commands:
# Generate mocked datasets:
# - Creates "_mocked_benchmark" case set in "data/case_set/"
# - Creates "_mocked_samples" sample set in "data/sample_set/"
# - Creates "_mocked_eval_task" evaluation task in "data/eval_task/"
python mock.py -case -sample -task
# Execute evaluation (mock mode):
# 1. Tasks prefixed with "_mocked_" will run in mock mode
# 2. Runtime profiles will be generated in "data/eval_task/task_running/"
# 3. Final results will be stored in "data/result/"
python evaluate.py _mocked_eval_task
# Cleanup mocked assets (optional):
# Note: This removes only the mocked datasets/tasks, not runtime profiles/results
python mock.py clean -case -sample -task
2. Creating The Case Set
Before evaluating, a case set must be created unless you have a case set already.
Create a case set by the following command:
# The name "my_benchmark" should be replaced with an appropriate name.
# Note: This command will ask you for the tile and description of the new case set, you can input any text for them.
python create_case_set.py my_benchmark
3. Making A Case
This section provides an example of how to create cases of creating resource.
3.1. Creating Workshop Directory
Create a directory named workshop_of_case.
3.2. Initializing Environment
Navigate into the workshop_of_case directory and create an HCL source file - e.g., main.tf as shown below:
# A provider block is required. You can specify the provider requirement block if needed.
provider "alicloud" {
region = "cn-shanghai"
}
Next, run the init and apply commands to initialize the environment:
# The init command downloads the provider plugins and creates the .terraform.lock.hcl file.
tofu init
# The apply command generates the Terraform state file.
tofu apply
At this point, the environment is initialized and ready for resource creation.
3.3. Saving The Pre-Operation Environment
Run the save_env.py script to capture the environment state before performing any operations:
python save_env.py pre_op
This will create a directory named pre_op inside the workshop directory,
containing the necessary files to represent the pre-operation environment state.
Then, run the apply command to create the OSS bucket:
tofu apply
3.5. Saving The Post-Operation Environment
Run the save_env.py script to capture the environment state after performing any operations:
python save_env.py post_op
This will create a directory named post_op inside the workshop directory,
containing the necessary files to represent the post-operation environment state.
3.6. Preparing A Manifest File
Create a file named manifest.yaml in the workshop directory to define the test cases. Below is an example:
resource_type: alicloud_oss_bucket
operation_type: create
singularity: s0
essentials:
- >
count(alicloud_oss_bucket) == 1
- >
opt("alicloud_oss_bucket_acl") is None or count(alicloud_oss_bucket_acl) == 1
- >
provider_of(first_of(alicloud_oss_bucket)).region == "cn-shanghai"
- >
opt("alicloud_oss_bucket_acl") is None or
provider_of(first_of(alicloud_oss_bucket_acl)).region == "cn-shanghai"
- >
first_of(alicloud_oss_bucket).bucket == "demo-bucket-2025-0507-1348"
- >
len(list(filter(
lambda x: x is not None,
{
first_of(alicloud_oss_bucket).opt("acl"),
first_of(opt("alicloud_oss_bucket_acl"))
}
))) == 1
- >
first_of(alicloud_oss_bucket).opt("acl") == "public-read" or
opt("alicloud_oss_bucket_acl") is not None and
first_of(alicloud_oss_bucket_acl).opt("acl") == "public-read"
- >
opt("alicloud_oss_bucket_acl") is None or first_of(alicloud_oss_bucket_acl).bucket ==
"${" + first_of(alicloud_oss_bucket).self_path() + ".bucket}"
- >
first_of(alicloud_oss_bucket).opt("storage_class") in {None, "Standard"}
- >
first_of(alicloud_oss_bucket).opt("lifecycle_rule") is None
actions:
side_effect: deny
required:
- action: create
pattern: alicloud_oss_bucket.*
count: 1
optional:
- action: create
pattern: alicloud_oss_bucket_acl.*
count: 1
- action: create
pattern: alicloud_oss_bucket_versioning.*
count: 1
security: []
misc:
- >
alicloud_oss_bucket.opt("demo_bucket") is not None
- >
alicloud_oss_bucket.demo_bucket.opt("versioning") is None or
alicloud_oss_bucket.demo_bucket.versioning.opt("status") == "Suspended"
- >
opt("alicloud_oss_bucket_versioning") is None or
first_of(alicloud_oss_bucket_versioning).bucket ==
"${" + first_of(alicloud_oss_bucket).self_path() + ".bucket}" and
first_of(alicloud_oss_bucket_versioning).opt("status") in {None, "Suspended"}
- >
alicloud_oss_bucket.demo_bucket.opt("cors_rule") is None
- >
alicloud_oss_bucket.demo_bucket.opt("website") is None
- >
alicloud_oss_bucket.demo_bucket.opt("logging") is None
- >
alicloud_oss_bucket.demo_bucket.opt("server_side_encryption_rule") is None
- >
alicloud_oss_bucket.demo_bucket.opt("transfer_acceleration") is None
- >
alicloud_oss_bucket.demo_bucket.opt("redundancy_type") is None or
alicloud_oss_bucket.demo_bucket.redundancy_type == "LRS"
- >
alicloud_oss_bucket.demo_bucket.opt("access_monitor") is None or
alicloud_oss_bucket.demo_bucket.access_monitor.opt("status") == "Disabled"
user_inputs:
i3: >
Create an Alibaba Cloud OSS bucket in the cn-shanghai region.
Name the bucket demo-bucket-2025-0507-1348, and set the resource block name to demo_bucket.
Set the ACL to public-read.
Do not enable versioning or lifecycle management.
i2: >
Create an Alibaba Cloud OSS bucket in the cn-shanghai region.
Name the bucket demo-bucket-2025-0507-1348.
Set the ACL to public-read.
i1: >
Create an OSS bucket named demo-bucket-2025-0507-1348 in Shanghai that anyone can read.
i0: >
Create an OSS bucket.
clarity: [c1, c0]
Each case is defined by a combination of five dimensions:
resource type, operation type, integrity, clarity, and singularity.
Given the combinations specified in this manifest.yaml, up to 8 unique cases can be described.
3.7. Saving The Cases To The Case Set
Use the save_cases.py script to save the cases defined in the workshop directory to a specified case set:
# "my_benchmark" refers to the name of an existing case set.
python save_cases.py my_benchmark
After saving the case, restore the real environment to the state captured in the “pre_op” snapshot.
Now that the cases have been created, you can repeat the steps above to add more cases as needed.
4. Setting Samples and Evaluation Task
Create a sample set file in data/sample_set. e.g. samples_testing.yaml:
kind: sample_set
sample_set:
title: Sample For Testing
description: This is the sample set is for testing.
samples:
- [ qwen-max-2025-01-25, builtin_agent.d1.e1 ]
- [ deepseek-chat, builtin_agent.d0.e1 ]
In the samples section of the YAML file, each sample is a pair consisting of a model name and a decorated agent name.
The model name refers to a specific LLM.
The decorated agent name includes the agent’s name and two feature flags:
“d0”: the agent does not support documentation.
“d1”: the agent does support documentation.
“e0”: the agent is not aware of the IaC environment.
“e1”: the agent is aware of the IaC environment.
“es”: the agent is only aware of the IaC state.
“ec”: the agent is only aware of the IaC code (HCL).
The builtin_agent is a special type of agent that can enable or disable specific features using feature flags.
At last, create an evaluation task file in data/eval_task, e.g. eval_testing.yaml:
Custom agents: If you’re using agents other than builtin_agent, you must implement corresponding adapters and register them in adapter/factory.py.
6. Evaluating
To start the evaluation, run evaluate.py with the task name:
python evaluate.py eval_testing
If the evaluation is interrupted, you can resume it by running resume.py with the same task name:
python resume.py eval_testing
7. Analyzing The Result
After the evaluation completes, a result file will be generated in the data/result directory.
You can analyze the results as needed based on your specific criteria.
1. Testing The Framework
This section is intended for framework developers. After modifying the framework’s core code, you can test it using mocked datasets by executing these commands:
2. Creating The Case Set
Before evaluating, a case set must be created unless you have a case set already. Create a case set by the following command:
3. Making A Case
This section provides an example of how to create cases of creating resource.
3.1. Creating Workshop Directory
Create a directory named
workshop_of_case.3.2. Initializing Environment
Navigate into the
workshop_of_casedirectory and create an HCL source file - e.g.,main.tfas shown below:Next, run the
initandapplycommands to initialize the environment:At this point, the environment is initialized and ready for resource creation.
3.3. Saving The Pre-Operation Environment
Run the
save_env.pyscript to capture the environment state before performing any operations:This will create a directory named
pre_opinside the workshop directory, containing the necessary files to represent the pre-operation environment state.3.4. Creating An OSS Bucket
Update the
main.tffile as follows:Then, run the
applycommand to create the OSS bucket:3.5. Saving The Post-Operation Environment
Run the
save_env.pyscript to capture the environment state after performing any operations:This will create a directory named
post_opinside the workshop directory, containing the necessary files to represent the post-operation environment state.3.6. Preparing A Manifest File
Create a file named manifest.yaml in the workshop directory to define the test cases. Below is an example:
Each case is defined by a combination of five dimensions: resource type, operation type, integrity, clarity, and singularity. Given the combinations specified in this
manifest.yaml, up to 8 unique cases can be described.3.7. Saving The Cases To The Case Set
Use the
save_cases.pyscript to save the cases defined in the workshop directory to a specified case set:After saving the case, restore the real environment to the state captured in the “pre_op” snapshot.
Now that the cases have been created, you can repeat the steps above to add more cases as needed.
4. Setting Samples and Evaluation Task
Create a sample set file in
data/sample_set. e.g.samples_testing.yaml:In the samples section of the YAML file, each sample is a pair consisting of a model name and a decorated agent name.
The
builtin_agentis a special type of agent that can enable or disable specific features using feature flags.At last, create an evaluation task file in
data/eval_task, e.g.eval_testing.yaml:Key fields explained:
5. Before Evaluating
Before running the evaluation, make sure the following preparations are complete:
Provider credentials: Store the credentials for the provider plugins in environment variables (e.g., ALICLOUD_ACCESS_KEY, ALICLOUD_SECRET_KEY).
Configuration file: Create or update the config.yaml file. Example:
Custom agents: If you’re using agents other than
builtin_agent, you must implement corresponding adapters and register them inadapter/factory.py.6. Evaluating
To start the evaluation, run
evaluate.pywith the task name:If the evaluation is interrupted, you can resume it by running
resume.pywith the same task name:7. Analyzing The Result
After the evaluation completes, a result file will be generated in the
data/resultdirectory. You can analyze the results as needed based on your specific criteria.