fmeval is a library to evaluate Large Language Models (LLMs) in order to help select the best LLM
for your use case. The library evaluates LLMs for the following tasks:
Open-ended generation - The production of natural human responses to text that does not have a pre-defined structure.
Text summarization - The generation of a condensed summary retaining the key information contained in a longer text.
Question Answering - The generation of a relevant and accurate response to an answer.
Classification - Assigning a category, such as a label or score to text, based on its content.
The library contains
Algorithms to evaluate LLMs for Accuracy, Toxicity, Semantic Robustness and
Prompt Stereotyping across different tasks.
Implementations of the ModelRunner interface. ModelRunner encapsulates the logic for invoking different types of LLMs, exposing a predict method to simplify interactions with LLMs within the eval algorithm code. We have built-in support for Amazon SageMaker Endpoints and JumpStart models. The user can extend the interface for their own model classes by implementing the predict method.
Installation
fmeval is developed under python3.10. To install the package, simply run:
pip install fmeval
Usage
You can see examples of running evaluations on your LLMs with built-in or custom datasets in
the examples folder.
The main steps for using fmeval are:
Create a ModelRunner which can perform invocation on your LLM. fmeval provides built-in support for Amazon SageMaker Endpoints and JumpStart LLMs. You can also extend the ModelRunner interface for any LLMs hosted anywhere.
Note: You can update the default eval config parameters for your specific use case.
Using a custom dataset for an evaluation
We have our built-in datasets configured, which are consumed for computing the scores in eval algorithms.
You can choose to use a custom dataset in the following manner.
Please refer to the developer guide and
examples for more details around the usage of
eval algorithms.
Telemetry
fmeval has telemetry enabled for tracking the usage of AWS-provided/hosted LLMs.
This data is tracked using the number of SageMaker or JumpStart ModelRunner objects that get created.
Telemetry can be disabled by setting the DISABLE_FMEVAL_TELEMETRY environment variable to true.
Troubleshooting
Users running fmeval on a Windows machine may encounter the error OSError: [Errno 0] AssignProcessToJobObject() failed when fmeval internally calls ray.init(). This OS error is a known Ray issue, and is detailed here. Multiple users have reported that installing Python from the official Python website rather than the Microsoft store fixes this issue. You can view more details on limitations of running Ray on Windows on Ray’s webpage.
If you run into the error error: can't find Rust compiler while installing fmeval on a Mac, please try running the steps below.
If you run into out of memory (OOM) errors, especially while running evaluations that use LLMs as evaluators like toxicity and
summarization accuracy, it is likely that your machine does not have enough memory to load the evaluator
models. By default, fmeval loads multiple copies of the model into memory to maximize parallelization, where the exact number depends on the number of cores on the machine. To reduce the number of models that get loaded in parallel, you can
set the environment variable PARALLELIZATION_FACTOR to a value that suits your machine.
Development
Setup and the use of devtool
Once you have created a virtual environment with python3.10, run the following command to set up the development environment:
./devtool install_deps_dev
./devtool install_deps
./devtool all
Note: If you are on a Mac, the install_poetry_version devtool command may fail when running the poetry installation script. If there is a failure, you should get error logs sent to a file with a name like poetry-installer-error-cvulo5s0.log. Open the logs, and if the error message looks like the following:
dyld[10908]: Library not loaded: @loader_path/../../../../Python.framework/Versions/3.10/Python
Referenced from: <8A5DEEDB-CE8E-325F-88B0-B0397BD5A5DE> /Users/daniezh/Library/Application Support/pypoetry/venv/bin/python3
Reason: tried: '/Users/daniezh/Library/Application Support/pypoetry/venv/bin/../../../../Python.framework/Versions/3.10/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.10/Python' (no such file), '/System/Library/Frameworks/Python.framework/Versions/3.10/Python' (no such file, not in dyld cache)
Traceback:
File "<string>", line 923, in main
File "<string>", line 562, in run
then you will need to tweak the poetry installation script and re-run it.
Change the symlinks argument in builder = venv.EnvBuilder(clear=True, with_pip=True, symlinks=False) to True. See mionker’s comment here for an explanation.
python poetry_script.py --version 1.8.2 (where 1.8.2 is the version listed in devtool; this may change after the time of this writing).
Confirm installation via poetry --version
Additionally, if you already have an existing version of Poetry installed and want to install a new version, before you re-run the above command, you will need to uninstall Poetry:
Before submitting a PR, rerun ./devtool all for testing and linting. It should run without errors.
Adding python dependencies
We use poetry to manage python dependencies in this project. If you want to add a new
dependency, please update the pyproject.toml file, and run the poetry update command to update the
poetry.lock file (which is checked in).
Other than this step to add dependencies, use devtool commands for installing dependencies, linting and testing. Execute the command ./devtool without any arguments to see a list of available options.
Adding your own evaluation algorithm and/or metrics
The evaluation algorithms and metrics provided by fmeval are implemented using Transform and TransformPipeline objects. You can leverage these existing tools to similarly implement your own metrics and algorithms in a modular manner.
Here, we provide a high-level overview of what these classes represent and how they are used. Specific implementation details can be found in their respective docstrings (see src/fmeval/transforms/transform.py and src/fmeval/transforms/transform_pipeline.py).
Preface
At a high level, an evaluation algorithm takes an initial tabular dataset consisting of a number of “records” (i.e. rows) and repeatedly transforms this dataset until the dataset either contains all the evaluation metrics, or at least all the intermediate data needed to compute said metrics. The transformations that get applied to the dataset inherently operate at a per-record level, and simply get applied to every record in the dataset to transform the dataset in full.
The Transform class
We represent the concept of a record-level transformation using the Transform class. Transform is a callable class where its __call__ method takes a single argument, record, which represents the record to be transformed. A record is represented by a Python dictionary. To implement your own record-level transformation logic, create a concrete subclass of Transform and implement its __call__ method.
Example:
Let’s implement a Transform for a simple, toy metric.
class NumSpaces(Transform):
"""
Augments the input record (which contains some text data)
with the number of spaces found in the text.
"""
def __call__(self, record: Dict[str, Any]) -> Dict[str, Any]:
input_text = record["input_text"]
record["num_spaces"] = input_text.count(" ")
return record
One issue with this simple example is that the keys used for the input text data and the output data are both hard-coded. This generally isn’t desirable, so let’s improve on our running example.
class NumSpaces(Transform):
"""
Augments the input record (which contains some text data)
with the number of spaces found in the text.
"""
def __init__(self, text_key, output_key):
super().__init__(text_key, output_key) # always need to pass all init args to superclass init
self.text_key = text_key # the dict key corresponding to the input text data
self.output_key = output_key # the dict key corresponding to the output data (i.e. number of spaces)
def __call__(self, record: Dict[str, Any]) -> Dict[str, Any]:
input_text = record[self.text_key]
record[self.output_key] = input_text.count(" ")
return record
Since __call__ only takes a single argument, record, we pass the information regarding which keys to use for input and output data to __init__ and save them as instance attributes. Note that all subclasses of Transform need to call super().__init__ with all of their __init__ arguments, due to low-level implementation details regarding how we apply the Transforms to the dataset.
The TransformPipeline class
While Transform encapsulates the logic for the record-level transformation, we still don’t have a mechanism for applying the transform to a dataset. This is where TransformPipeline comes in. A TransformPipeline represents a sequence, or “pipeline”, of Transform objects that you wish to apply to a dataset. After initializing a TransformPipeline with a list of Transforms, simply call its execute method on an input dataset.
Example:
Here, we implement a pipeline for a very simple evaluation. The steps are:
Construct LLM prompts from raw text inputs
Feed the prompts to a ModelRunner to get the model outputs
Compute the “number of spaces” metric we defined above
# Use the built-in utility Transform for generating prompts
gen_prompt = GeneratePrompt(
input_keys="model_input",
output_keys="prompt",
prompt_template="Answer the following question: $model_input",
)
# Use the built-in utility Transform for getting model outputs
model = ... # some ModelRunner
get_model_outputs = GetModelOutputs(
input_to_output_keys={"prompt": ["model_output"]},
model_runner=model,
)
# Our new metric!
compute_num_spaces = NumSpaces(
text_key="model_output",
output_key="num_spaces",
)
my_pipeline = TransformPipeline([gen_prompt, get_model_outputs, compute_num_spaces])
dataset = # load some dataset
dataset = my_pipeline.execute(dataset)
Conclusion
To implement new metrics, create a new Transform that encapsulates the logic for computing said metric. Since the logic for all evaluation algorithms can be represented as a sequence of different Transforms, implementing a new evaluation algorithm essentially amounts to defining a TransformPipeline. Please see the built-in evaluation algorithms for examples.
Foundation Model Evaluations Library
fmevalis a library to evaluate Large Language Models (LLMs) in order to help select the best LLM for your use case. The library evaluates LLMs for the following tasks:The library contains
ModelRunnerinterface.ModelRunnerencapsulates the logic for invoking different types of LLMs, exposing apredictmethod to simplify interactions with LLMs within the eval algorithm code. We have built-in support for Amazon SageMaker Endpoints and JumpStart models. The user can extend the interface for their own model classes by implementing thepredictmethod.Installation
fmevalis developed under python3.10. To install the package, simply run:Usage
You can see examples of running evaluations on your LLMs with built-in or custom datasets in the examples folder.
The main steps for using
fmevalare:ModelRunnerwhich can perform invocation on your LLM.fmevalprovides built-in support for Amazon SageMaker Endpoints and JumpStart LLMs. You can also extend theModelRunnerinterface for any LLMs hosted anywhere.For example,
Note: You can update the default eval config parameters for your specific use case.
Using a custom dataset for an evaluation
We have our built-in datasets configured, which are consumed for computing the scores in eval algorithms. You can choose to use a custom dataset in the following manner.
Create a DataConfig for your custom dataset
Use an eval algorithm with a custom dataset
Please refer to the developer guide and examples for more details around the usage of eval algorithms.
Telemetry
fmevalhas telemetry enabled for tracking the usage of AWS-provided/hosted LLMs. This data is tracked using the number of SageMaker or JumpStartModelRunnerobjects that get created. Telemetry can be disabled by setting theDISABLE_FMEVAL_TELEMETRYenvironment variable totrue.Troubleshooting
Users running
fmevalon a Windows machine may encounter the errorOSError: [Errno 0] AssignProcessToJobObject() failedwhenfmevalinternally callsray.init(). This OS error is a known Ray issue, and is detailed here. Multiple users have reported that installing Python from the official Python website rather than the Microsoft store fixes this issue. You can view more details on limitations of running Ray on Windows on Ray’s webpage.If you run into the error
error: can't find Rust compilerwhile installingfmevalon a Mac, please try running the steps below.fmevalloads multiple copies of the model into memory to maximize parallelization, where the exact number depends on the number of cores on the machine. To reduce the number of models that get loaded in parallel, you can set the environment variablePARALLELIZATION_FACTORto a value that suits your machine.Development
Setup and the use of
devtoolOnce you have created a virtual environment with python3.10, run the following command to set up the development environment:
Note: If you are on a Mac, the
install_poetry_versiondevtool command may fail when running the poetry installation script. If there is a failure, you should get error logs sent to a file with a name likepoetry-installer-error-cvulo5s0.log. Open the logs, and if the error message looks like the following:then you will need to tweak the poetry installation script and re-run it.
Steps:
curl -sSL https://install.python-poetry.org > poetry_script.pysymlinksargument inbuilder = venv.EnvBuilder(clear=True, with_pip=True, symlinks=False)toTrue. See mionker’s comment here for an explanation.python poetry_script.py --version 1.8.2(where1.8.2is the version listed indevtool; this may change after the time of this writing).poetry --versionAdditionally, if you already have an existing version of Poetry installed and want to install a new version, before you re-run the above command, you will need to uninstall Poetry:
curl -sSL https://install.python-poetry.org | python3 - --uninstallBefore submitting a PR, rerun
./devtool allfor testing and linting. It should run without errors.Adding python dependencies
We use poetry to manage python dependencies in this project. If you want to add a new dependency, please update the pyproject.toml file, and run the
poetry updatecommand to update thepoetry.lockfile (which is checked in).Other than this step to add dependencies, use devtool commands for installing dependencies, linting and testing. Execute the command
./devtoolwithout any arguments to see a list of available options.Adding your own evaluation algorithm and/or metrics
The evaluation algorithms and metrics provided by
fmevalare implemented usingTransformandTransformPipelineobjects. You can leverage these existing tools to similarly implement your own metrics and algorithms in a modular manner.Here, we provide a high-level overview of what these classes represent and how they are used. Specific implementation details can be found in their respective docstrings (see
src/fmeval/transforms/transform.pyandsrc/fmeval/transforms/transform_pipeline.py).Preface
At a high level, an evaluation algorithm takes an initial tabular dataset consisting of a number of “records” (i.e. rows) and repeatedly transforms this dataset until the dataset either contains all the evaluation metrics, or at least all the intermediate data needed to compute said metrics. The transformations that get applied to the dataset inherently operate at a per-record level, and simply get applied to every record in the dataset to transform the dataset in full.
The
TransformclassWe represent the concept of a record-level transformation using the
Transformclass.Transformis a callable class where its__call__method takes a single argument,record, which represents the record to be transformed. A record is represented by a Python dictionary. To implement your own record-level transformation logic, create a concrete subclass ofTransformand implement its__call__method.Example:
Let’s implement a
Transformfor a simple, toy metric.One issue with this simple example is that the keys used for the input text data and the output data are both hard-coded. This generally isn’t desirable, so let’s improve on our running example.
Since
__call__only takes a single argument,record, we pass the information regarding which keys to use for input and output data to__init__and save them as instance attributes. Note that all subclasses ofTransformneed to callsuper().__init__with all of their__init__arguments, due to low-level implementation details regarding how we apply theTransforms to the dataset.The
TransformPipelineclassWhile
Transformencapsulates the logic for the record-level transformation, we still don’t have a mechanism for applying the transform to a dataset. This is whereTransformPipelinecomes in. ATransformPipelinerepresents a sequence, or “pipeline”, ofTransformobjects that you wish to apply to a dataset. After initializing aTransformPipelinewith a list ofTransforms, simply call itsexecutemethod on an input dataset.Example: Here, we implement a pipeline for a very simple evaluation. The steps are:
ModelRunnerto get the model outputsConclusion
To implement new metrics, create a new
Transformthat encapsulates the logic for computing said metric. Since the logic for all evaluation algorithms can be represented as a sequence of differentTransforms, implementing a new evaluation algorithm essentially amounts to defining aTransformPipeline. Please see the built-in evaluation algorithms for examples.Security
See CONTRIBUTING for more information.
License
This project is licensed under the Apache-2.0 License.