🤔 Considering integrating LLMs into your cloud application but unsure about their performance? Check this out 👇
📊 The table is generated with CloudEval-YAML, a practical Large Language Model (LLM) benchmark specifically designed for cloud-native application configuration generation. It comprises a comprehensive dataset of 1011 problems covering widely deployed applications including Kubernetes, Envoy and Istio. This benchmark is unique in its practical and realistic approach, featuring a combination of hand-written question contexts, reference YAML files, and unit test scripts for each problem. This benchmark can facilitate comparing different LLM models, as well as different sampling and prompt engineering techniques.
📄 Please refer to the following technical report for more detailed information:
CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation.
🚀 Quick Start
Benchmarking GPT-3.5 with all metrics but unit tests.
Replicate is a platform that allows you to run open-source or your own models at scale in the cloud. It abstracts away the complexity of deploying models locally and provides a simple API for query and response.
Configure Replicate API token
export REPLICATE_API_TOKEN=='YOUR_API_TOKEN'
Currently, a subset of supported open-source models are integrated into the benchmark. You may include them for benchmarking in config.json.
Model-specific query is defined in models/replicate.py. You may customize them, integrate more or even include your own models if needed. Please refer to Replicate API documentation for more details.
🏃 Run the Unit Tests
The easiest way to run unit tests is to launch an AWS EC2 instance using public AMI with cloudyaml_public as name and ami-012306c17129a3b71 as ID in us-west-1 region (N. California). Notice that this is going to take several hours to evaluate. Alternatively, one can create a pull request and we can help to run on a cluster.
After launching the EC2 instance, you can pull this repo and enable unit test in config.json
Please refer to Advanced.md for how to run your own model and other advanced features.
📄 Citation
If you find this benchmark useful, please cite our paper:
@article{xu2024cloudeval,
title={CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation},
author={Xu, Yifei and Chen, Yuning and Zhang, Xumiao and Lin, Xianshang and Hu, Pan and Ma, Yunfei and Lu, Songwu and Du, Wan and Mao, Zhuoqing and Zhai, Ennan and others},
journal={Proceedings of Machine Learning and Systems},
volume={6},
pages={173--195},
year={2024}
}
☁️ CloudEval-YAML
🤔 Considering integrating LLMs into your cloud application but unsure about their performance? Check this out 👇
📊 The table is generated with CloudEval-YAML, a practical Large Language Model (LLM) benchmark specifically designed for cloud-native application configuration generation. It comprises a comprehensive dataset of 1011 problems covering widely deployed applications including Kubernetes, Envoy and Istio. This benchmark is unique in its practical and realistic approach, featuring a combination of hand-written question contexts, reference YAML files, and unit test scripts for each problem. This benchmark can facilitate comparing different LLM models, as well as different sampling and prompt engineering techniques. 📄 Please refer to the following technical report for more detailed information: CloudEval-YAML: A Practical Benchmark for Cloud Native YAML Configuration Generation.
🚀 Quick Start
Benchmarking GPT-3.5 with all metrics but unit tests.
📋 Prerequisites:
Install the python prerequisites
🔑 Configure OpenAI API Key
🏃 Run the Benchmark
Sample output from the terminal:
All prompts, generated codes and scores by default will be saved in
outputs/. You man use them as checkpoints for further analysis.🎉 Congratulations! You have successfully completed the quickstart and benchmarked GPT-3.5 (other than unit tests).
🤖 Benchmarking Different Models
OpenAI GPT-3.5/4
config.jsonmodels/gpt.py. You may customize them if needed. Please refer to OpenAI API documentation for more details.Google PaLM 2
config.jsonmodels/palm.py. You may customize them if needed. Please refer to Google API documentation for more details.Replicate
Replicate is a platform that allows you to run open-source or your own models at scale in the cloud. It abstracts away the complexity of deploying models locally and provides a simple API for query and response.
config.json.models/replicate.py. You may customize them, integrate more or even include your own models if needed. Please refer to Replicate API documentation for more details.🏃 Run the Unit Tests
The easiest way to run unit tests is to launch an AWS EC2 instance using public AMI with
cloudyaml_publicas name andami-012306c17129a3b71as ID in us-west-1 region (N. California). Notice that this is going to take several hours to evaluate. Alternatively, one can create a pull request and we can help to run on a cluster.After launching the EC2 instance, you can pull this repo and enable unit test in
config.jsonThen run the benchmark:
Sample output from the terminal:
Please refer to Advanced.md for how to run your own model and other advanced features.
📄 Citation
If you find this benchmark useful, please cite our paper: