Crystal Generation with Space Group Informed Transformer
CrystalFormer is a transformer-based autoregressive model specifically designed for space group-controlled generation of crystalline materials. The space group symmetry significantly simplifies the
crystal space, which is crucial for data and compute efficient generative modeling of crystalline materials.
The model is an autoregressive transformer for the space group conditioned crystal probability distribution P(C|g) = P (W_1 | ... ) P ( A_1 | ... ) P(X_1| ...) P(W_2|...) ... P(L| ...), where
g: space group number 1-230
W: Wyckoff letter (‘a’, ‘b’,…,’A’)
A: atom type (‘H’, ‘He’, …, ‘Og’)
X: factional coordinates
L: lattice vector [a,b,c, alpha, beta, gamma]
P(W_i| ...) and P(A_i| ...) are categorical distributuions.
P(X_i| ...) is the mixture of von Mises distribution.
P(L| ...) is the mixture of Gaussian distribution.
We only consider symmetry inequivalent atoms. The remaining atoms are restored based on the space group and Wyckoff letter information. Note that there is a natural alphabetical ordering for the Wyckoff letters, starting with ‘a’ for a position with the site-symmetry group of maximal order and ending with the highest letter for the general position. The sampling procedure starts from higher symmetry sites (with smaller multiplicities) and then goes on to lower symmetry ones (with larger multiplicities). Only for the cases where discrete Wyckoff letters can not fully determine the structure, one needs to further consider factional coordinates in the loss or sampling.
Status
Major milestones are summarized below.
v0.4.2 : Add implementation of direct preference optimization.
v0.4.1 : Replace the absolute positional embedding with the Rotary Positional Embedding (RoPE).
v0.3 : Add conditional generation in the plug-and-play manner.
v0.2 : Add Markov chain Monte Carlo (MCMC) sampling for template-based structure generation.
v0.1 : Initial implementations of crystalline material generation conditioned on the space group.
Get Started
Notebooks: The quickest way to get started with CrystalFormer is our notebooks in the Google Colab and Bohrium (Chinese version) platforms:
CrystalFormer Quickstart : GUI notebook demonstrating the conditional generation of crystalline materials with CrystalFormer
CrystalFormer Application : Generating stable crystals with a given structure prototype. This workflow can be applied to tasks that are dominated by element substitution
CrystalFormer-RL : Reinforcement fine-tuning for materials design
Installation
Create a new environment and install the required packages, we recommend using python 3.10.* and conda to create the environment:
Before installing the required packages, you need to install jax and jaxlib first.
CPU installation
pip install -U "jax[cpu]"
CUDA (GPU) installation
If you intend to use CUDA (GPU) to speed up the training, it is important to install the appropriate version of jax and jaxlib. It is recommended to check the jax docs for the installation guide. The basic installation command is given below:
pip install --upgrade pip
# NVIDIA CUDA 12 installation
# Note: wheels only available on linux.
pip install -U "jax[cuda12]"
install required packages
pip install -r requirements.txt
command line tools
To use the command line tools, you need to install the crystalformer package. You can use the following command to install the package:
pip install .
Available Weights
We release the weights of the model trained on the MP-20 dataset and Alex-20 dataset. More details can be seen in the model folder.
optimizer: the optimizer to use, none means no training, only sampling
restore_path: the path to the model weights
spacegroup: the space group number to sample
num_samples: the number of samples to generate
batchsize: the batch size for sampling
temperature: the temperature for sampling
You can also use the elements to sample the specific element. For example, --elements La Ni O will sample the structure with La, Ni, and O atoms. The sampling results will be saved in the output_LABEL.csv file, where the LABEL is the space group number g specified in the command --spacegroup.
The input for the elements can be also the json file which specifies the atom mask in each Wyckoff site and the constraints. An example atoms.json file can be seen in the data folder. There are two keys in the atoms.json file:
atom_mask: set the atom list for each Wyckoff position, the element can only be selected from the list in the corresponding Wyckoff position
constraints: set the constraints for the Wyckoff sites in the sampling, you can specify the pair of Wyckoff sites that should have the same elements
evaluate
Before evaluating the generated structures, you need to transform the generated g, W, A, X, L to the cif format. You can use the following command to transform the generated structures to the cif format and save as the csv file:
output_path: the path to read the generated L, W, A, X and save the cif files
label: the label to save the cif files, which is the space group number g
num_io_process: the number of processes
[!IMPORTANT]
The following evaluation script requires the SMACT, matminer, and matbench-genmetrics packages. We recommend installing them in a separate environment to avoid conflicts with other packages.
Calculate the structure and composition validity of the generated structures:
label: the label to save the metrics results, which is the space group number g
num_io_process: the number of processes
Note that the training, test, and generated datasets should contain the structures within the same space group g which is specified in the command --label.
More details about the post-processing can be seen in the scripts folder.
Reinforcement Fine-tuning
[!IMPORTANT]
Before running the reinforcement fine-tuning, please make sure you have installed the corresponding machine learning force field model or property prediction model. The mlff_model and mlff_path arguments in the command line should be set according to the model you are using. Now we support theorb and MACE models for the $E_{hull}$ reward, and the matgl model for the dielectric FoM reward.
restore_path: the path to the pre-trained model weights
valid_path: the path to the validation dataset
test_path: the path to the test dataset. The space group distribution will be loaded from this dataset and used for the sampling in the reinforcement learning fine-tuning
reward: the reward function to use, ehull means the energy above the convex hull
convex_path: the path to the convex hull data, which is used to calculate the $E_{hull}$. Only used when the reward is ehull
mlff_model: the machine learning force field model to predict the total energy. We support orb and MACE models for the $E_{hull}$ reward
mlff_path: the path to load the checkpoint of the machine learning force field model
restore_path: the path to the pre-trained model weights
valid_path: the path to the validation dataset
test_path: the path to the test dataset. The space group distribution will be loaded from this dataset and used for the sampling in the reinforcement learning fine-tuning
reward: the reward function to use, dielectric means the dielectric figure of merit (FoM), which is the product of the total dielectric constant and the band gap
mlff_model: the machine learning force field model to predict the total energy. We only support models in matgl for the dielectric reward
mlff_path: the path to load the checkpoint of the machine learning force field model. Note that you need to provide the model paths for the total dielectric constant and band gap, separated by the ,
How to cite
@article{cao2024space,
title={Space Group Informed Transformer for Crystalline Materials Generation},
author={Zhendong Cao and Xiaoshan Luo and Jian Lv and Lei Wang},
year={2024},
eprint={2403.15734},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci}
}
@article{cao2025crystalformerrl,
title={CrystalFormer-RL: Reinforcement Fine-Tuning for Materials Design},
author={Zhendong Cao and Lei Wang},
year={2025},
eprint={2504.02367},
archivePrefix={arXiv},
primaryClass={cond-mat.mtrl-sci},
url={https://arxiv.org/abs/2504.02367},
}
Crystal Generation with Space Group Informed Transformer
CrystalFormer is a transformer-based autoregressive model specifically designed for space group-controlled generation of crystalline materials. The space group symmetry significantly simplifies the crystal space, which is crucial for data and compute efficient generative modeling of crystalline materials.
Generating Cs2ZnFe(CN)6 Crystal (mp-570545)
Contents
Model card
The model is an autoregressive transformer for the space group conditioned crystal probability distribution
P(C|g) = P (W_1 | ... ) P ( A_1 | ... ) P(X_1| ...) P(W_2|...) ... P(L| ...), whereg: space group number 1-230W: Wyckoff letter (‘a’, ‘b’,…,’A’)A: atom type (‘H’, ‘He’, …, ‘Og’)X: factional coordinatesL: lattice vector [a,b,c, alpha, beta, gamma]P(W_i| ...)andP(A_i| ...)are categorical distributuions.P(X_i| ...)is the mixture of von Mises distribution.P(L| ...)is the mixture of Gaussian distribution.We only consider symmetry inequivalent atoms. The remaining atoms are restored based on the space group and Wyckoff letter information. Note that there is a natural alphabetical ordering for the Wyckoff letters, starting with ‘a’ for a position with the site-symmetry group of maximal order and ending with the highest letter for the general position. The sampling procedure starts from higher symmetry sites (with smaller multiplicities) and then goes on to lower symmetry ones (with larger multiplicities). Only for the cases where discrete Wyckoff letters can not fully determine the structure, one needs to further consider factional coordinates in the loss or sampling.
Status
Major milestones are summarized below.
Get Started
Notebooks: The quickest way to get started with CrystalFormer is our notebooks in the Google Colab and Bohrium (Chinese version) platforms:
Installation
Create a new environment and install the required packages, we recommend using python
3.10.*and conda to create the environment:Before installing the required packages, you need to install
jaxandjaxlibfirst.CPU installation
CUDA (GPU) installation
If you intend to use CUDA (GPU) to speed up the training, it is important to install the appropriate version of
jaxandjaxlib. It is recommended to check the jax docs for the installation guide. The basic installation command is given below:install required packages
command line tools
To use the command line tools, you need to install the
crystalformerpackage. You can use the following command to install the package:Available Weights
We release the weights of the model trained on the MP-20 dataset and Alex-20 dataset. More details can be seen in the model folder.
How to run
train
folder: the folder to save the model and logstrain_path: the path to the training datasetvalid_path: the path to the validation datasettest_path: the path to the test datasetsample
optimizer: the optimizer to use,nonemeans no training, only samplingrestore_path: the path to the model weightsspacegroup: the space group number to samplenum_samples: the number of samples to generatebatchsize: the batch size for samplingtemperature: the temperature for samplingYou can also use the
elementsto sample the specific element. For example,--elements La Ni Owill sample the structure with La, Ni, and O atoms. The sampling results will be saved in theoutput_LABEL.csvfile, where theLABELis the space group numbergspecified in the command--spacegroup.The input for the
elementscan be also thejsonfile which specifies the atom mask in each Wyckoff site and the constraints. An exampleatoms.jsonfile can be seen in the data folder. There are two keys in theatoms.jsonfile:atom_mask: set the atom list for each Wyckoff position, the element can only be selected from the list in the corresponding Wyckoff positionconstraints: set the constraints for the Wyckoff sites in the sampling, you can specify the pair of Wyckoff sites that should have the same elementsevaluate
Before evaluating the generated structures, you need to transform the generated
g, W, A, X, Lto thecifformat. You can use the following command to transform the generated structures to thecifformat and save as thecsvfile:output_path: the path to read the generatedL, W, A, Xand save theciffileslabel: the label to save theciffiles, which is the space group numbergnum_io_process: the number of processesCalculate the structure and composition validity of the generated structures:
root_path: the path to the datasetfilename: the filename of the generated structuresnum_io_process: the number of processesCalculate the novelty and uniqueness of the generated structures:
train_path: the path to the training datasettest_path: the path to the test datasetgen_path: the path to the generated datasetoutput_path: the path to save the metrics resultslabel: the label to save the metrics results, which is the space group numbergnum_io_process: the number of processesNote that the training, test, and generated datasets should contain the structures within the same space group
gwhich is specified in the command--label.More details about the post-processing can be seen in the scripts folder.
Reinforcement Fine-tuning
$E_{hull}$ Reward
folder: the folder to save the model and logsrestore_path: the path to the pre-trained model weightsvalid_path: the path to the validation datasettest_path: the path to the test dataset. The space group distribution will be loaded from this dataset and used for the sampling in the reinforcement learning fine-tuningreward: the reward function to use,ehullmeans the energy above the convex hullconvex_path: the path to the convex hull data, which is used to calculate the $E_{hull}$. Only used when the reward isehullmlff_model: the machine learning force field model to predict the total energy. We supportorbandMACEmodels for the $E_{hull}$ rewardmlff_path: the path to load the checkpoint of the machine learning force field modelDielectric FoM Reward
folder: the folder to save the model and logsrestore_path: the path to the pre-trained model weightsvalid_path: the path to the validation datasettest_path: the path to the test dataset. The space group distribution will be loaded from this dataset and used for the sampling in the reinforcement learning fine-tuningreward: the reward function to use,dielectricmeans the dielectric figure of merit (FoM), which is the product of the total dielectric constant and the band gapmlff_model: the machine learning force field model to predict the total energy. We only support models inmatglfor the dielectric rewardmlff_path: the path to load the checkpoint of the machine learning force field model. Note that you need to provide the model paths for the total dielectric constant and band gap, separated by the,How to cite
Note: This project is unrelated to https://github.com/omron-sinicx/crystalformer with the same name.