Factor non-persistent param init out of __init__ into a common method that can be externally called via init_non_persistent_buffers() after meta-device init.
extra flexibility and improved handling for conv weights and fallbacks for weight shapes not suited for orthogonalization
small speedup for NS iterations by reducing allocs and using fused (b)add(b)mm ops
by default uses AdamW (or NAdamW if nesterov=True) updates if muon not suitable for parameter shape (or excluded via param group flag)
like torch impl, select from several LR scale adjustment fns via adjust_lr_fn
select from several NS coefficient presets or specify your own via ns_coefficients
First 2 steps of ‘meta’ device model initialization supported
Fix several ops that were breaking creation under ‘meta’ device context
Add device & dtype factory kwarg support to all models and modules (anything inherting from nn.Module) in timm
License fields added to pretrained cfgs in code
Release 1.0.21
Sept 21, 2025
Remap DINOv3 ViT weight tags from lvd_1689m -> lvd1689m to match (same for sat_493m -> sat493m)
Release 1.0.20
Sept 17, 2025
DINOv3 (https://arxiv.org/abs/2508.10104) ConvNeXt and ViT models added. ConvNeXt models were mapped to existing timm model. ViT support done via the EVA base model w/ a new RotaryEmbeddingDinoV3 to match the DINOv3 specific RoPE impl
MobileCLIP-2 (https://arxiv.org/abs/2508.20691) vision encoders. New MCI3/MCI4 FastViT variants added and weights mapped to existing FastViT and B, L/14 ViTs.
Add set_input_size() method to EVA models, used by OpenCLIP 3.0.0 to allow resizing for timm based encoder models.
Release 1.0.18, needed for PE-Core S & T models in OpenCLIP 3.0.0
Fix small typing issue that broke Python 3.9 compat. 1.0.19 patch release.
July 21, 2025
ROPE support added to NaFlexViT. All models covered by the EVA base (eva.py) including EVA, EVA02, Meta PE ViT, timm SBB ViT w/ ROPE, and Naver ROPE-ViT can be now loaded in NaFlexViT when use_naflex=True passed at model creation time
More Meta PE ViT encoders added, including small/tiny variants, lang variants w/ tiling, and more spatial variants.
PatchDropout fixed with NaFlexViT and also w/ EVA models (regression after adding Naver ROPE-ViT)
Fix XY order with grid_indexing=’xy’, impacted non-square image use in ‘xy’ mode (only ROPE-ViT and PE impacted).
July 7, 2025
MobileNet-v5 backbone tweaks for improved Google Gemma 3n behaviour (to pair with updated official weights)
Add stem bias (zero’d in updated weights, compat break with old weights)
GELU -> GELU (tanh approx). A minor change to be closer to JAX
Add two arguments to layer-decay support, a min scale clamp and ‘no optimization’ scale threshold
Add ‘Fp32’ LayerNorm, RMSNorm, SimpleNorm variants that can be enabled to force computation of norm in float32
Some typing, argument cleanup for norm, norm+act layers done with above
Support Naver ROPE-ViT (https://github.com/naver-ai/rope-vit) in eva.py, add RotaryEmbeddingMixed module for mixed mode, weights on HuggingFace Hub
model
img_size
top1
top5
param_count
vit_large_patch16_rope_mixed_ape_224.naver_in1k
224
84.84
97.122
304.4
vit_large_patch16_rope_mixed_224.naver_in1k
224
84.828
97.116
304.2
vit_large_patch16_rope_ape_224.naver_in1k
224
84.65
97.154
304.37
vit_large_patch16_rope_224.naver_in1k
224
84.648
97.122
304.17
vit_base_patch16_rope_mixed_ape_224.naver_in1k
224
83.894
96.754
86.59
vit_base_patch16_rope_mixed_224.naver_in1k
224
83.804
96.712
86.44
vit_base_patch16_rope_ape_224.naver_in1k
224
83.782
96.61
86.59
vit_base_patch16_rope_224.naver_in1k
224
83.718
96.672
86.43
vit_small_patch16_rope_224.naver_in1k
224
81.23
95.022
21.98
vit_small_patch16_rope_mixed_224.naver_in1k
224
81.216
95.022
21.99
vit_small_patch16_rope_ape_224.naver_in1k
224
81.004
95.016
22.06
vit_small_patch16_rope_mixed_ape_224.naver_in1k
224
80.986
94.976
22.06
Some cleanup of ROPE modules, helpers, and FX tracing leaf registration
Preparing version 1.0.17 release
June 26, 2025
MobileNetV5 backbone (w/ encoder only variant) for Gemma 3n image encoder
Version 1.0.16 released
June 23, 2025
Add F.grid_sample based 2D and factorized pos embed resize to NaFlexViT. Faster when lots of different sizes (based on example by https://github.com/stas-sl).
Further speed up patch embed resample by replacing vmap with matmul (based on snippet by https://github.com/stas-sl).
Add 3 initial native aspect NaFlexViT checkpoints created while testing, ImageNet-1k and 3 different pos embed configs w/ same hparams.
The training has some extra args features worth noting
The --naflex-train-seq-lens' argument specifies which sequence lengths to randomly pick from per batch during training
The --naflex-max-seq-len argument sets the target sequence length for validation
Adding --model-kwargs enable_patch_interpolator=True --naflex-patch-sizes 12 16 24 will enable random patch size selection per-batch w/ interpolation
The --naflex-loss-scale arg changes loss scaling mode per batch relative to the batch size, timm NaFlex loading changes the batch size for each seq len
Update EVA ViT (closest match) to support Perception Encoder models (https://arxiv.org/abs/2504.13181) from Meta, loading Hub weights but I still need to push dedicated timm weights
Add some flexibility to ROPE impl
Big increase in number of models supporting forward_intermediates() and some additional fixes thanks to https://github.com/brianhou0208
Add local-dir: pretrained schema, can use local-dir:/path/to/model/folder for model name to source model / pretrained cfg & weights Hugging Face Hub models (config.json + weights file) from a local folder.
Fix existing RmsNorm layer & fn to match standard formulation, use PT 2.5 impl when possible. Move old impl to SimpleNorm layer, it’s LN w/o centering or bias. There were only two timm models using it, and they have been updated.
Allow override of cache_dir arg for model creation
Pass through trust_remote_code for HF datasets wrapper
inception_next_atto model added by creator
Adan optimizer caution, and Lamb decoupled weight decay options
All OpenCLIP and JAX (CLIP, SigLIP, Pali, etc) model weights that used load time remapping were given their own HF Hub instances so that they work with hf-hub: based loading, and thus will work with new Transformers TimmWrapperModel
Introduction
PyTorch Image Models (timm) is a collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts that aim to pull together a wide variety of SOTA models with ability to reproduce ImageNet training results.
The work of many others is present here. I’ve tried to make sure all source material is acknowledged via links to github, arxiv papers, etc in the README, documentation, and code docstrings. Please let me know if I missed anything.
Features
Models
All model architecture families include variants with pretrained weights. There are specific model variants without any weights, it is NOT a bug. Help training new or better weights is always appreciated.
Several (less common) features that I often utilize in my projects are included. Many of their additions are the reason why I maintain my own set of models, instead of using others’ via PIP:
All models have a common default configuration interface and API for
accessing/changing the classifier - get_classifier and reset_classifier
doing a forward pass on just the features - forward_features (see documentation)
these makes it easy to write consistent network wrappers that work with any of the models
All models support multi-scale feature map extraction (feature pyramids) via create_model (see documentation)
out_indices creation arg specifies which feature maps to return, these indices are 0 based and generally correspond to the C(i + 1) feature level.
output_stride creation arg controls output stride of the network by using dilated convolutions. Most networks are stride 32 by default. Not all networks support this.
feature map channel counts, reduction level (stride) can be queried AFTER model creation via the .feature_info member
All models have a consistent pretrained weight loader that adapts last linear if necessary, and from 3 to 1 channel input if desired
NVIDIA DDP w/ a single GPU per process, multiple processes with APEX present (AMP mixed-precision optional)
PyTorch DistributedDataParallel w/ multi-gpu, single process (AMP disabled as it crashes when enabled)
PyTorch w/ single GPU single process (AMP optional)
A dynamic global pool implementation that allows selecting from average pooling, max pooling, average + max, or concat([average, max]) at model creation. All global pooling is adaptive average by default and compatible with pretrained weights.
A ‘Test Time Pool’ wrapper that can wrap any of the included models and usually provides improved performance doing inference with input images larger than the training size. Idea adapted from original DPN implementation when I ported (https://github.com/cypw/DPNs)
timmdocs is an alternate set of documentation for timm. A big thanks to Aman Arora for his efforts creating timmdocs.
paperswithcode is a good resource for browsing the models within timm.
Train, Validation, Inference Scripts
The root folder of the repository contains reference train, validation, and inference scripts that work with the included models and other features of this repository. They are adaptable for other datasets and use cases with a little hacking. See documentation.
Awesome PyTorch Resources
One of the greatest assets of PyTorch is the community and their contributions. A few of my favourite resources that pair well with the models and components here are listed below.
Object Detection, Instance and Semantic Segmentation
The code here is licensed Apache 2.0. I’ve taken care to make sure any third party code included or adapted has compatible (permissive) licenses such as MIT, BSD, etc. I’ve made an effort to avoid any GPL / LGPL conflicts. That said, it is your responsibility to ensure you comply with licenses here and conditions of any dependent licenses. Where applicable, I’ve linked the sources/references for various components in docstrings. If you think I’ve missed anything please create an issue.
Pretrained Weights
So far all of the pretrained weights available here are pretrained on ImageNet with a select few that have some additional pretraining (see extra note below). ImageNet was released for non-commercial research purposes only (https://image-net.org/download). It’s not clear what the implications of that are for the use of pretrained weights from that dataset. Any models I have trained with ImageNet are done for research purposes and one should assume that the original dataset license applies to the weights. It’s best to seek legal advice if you intend to use the pretrained weights in a commercial product.
Pretrained on more than ImageNet
Several weights included or references here were pretrained with proprietary datasets that I do not have access to. These include the Facebook WSL, SSL, SWSL ResNe(Xt) and the Google Noisy Student EfficientNet models. The Facebook models have an explicit non-commercial license (CC-BY-NC 4.0, https://github.com/facebookresearch/semi-supervised-ImageNet1K-models, https://github.com/facebookresearch/WSL-Images). The Google models do not appear to have any restriction beyond the Apache 2.0 license (and ImageNet concerns). In either case, you should contact Facebook or Google with any questions.
Citing
BibTeX
@misc{rw2019timm,
author = {Ross Wightman},
title = {PyTorch Image Models},
year = {2019},
publisher = {GitHub},
journal = {GitHub repository},
doi = {10.5281/zenodo.4414861},
howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}
PyTorch Image Models
What’s New
Feb 23, 2026
Jan 21, 2026
ParallelScalingBlock(&DiffParallelScalingBlock)timmmodels but could impact downstream use.Jan 5 & 6, 2026
Dec 30, 2025
dpwee,dwee,dlittle(differential) ViTs with a small boost over previous runstimmvariant of the CSATv2 model at 512x512 & 640x640__init__into a common method that can be externally called viainit_non_persistent_buffers()after meta-device init.Dec 12, 2025
timmMuon impl. Appears more competitive vs AdamW with familiar hparams for image tasks.DiffAttention), add correspondingDiffParallelScalingBlock(for ViT), train some wee vitsLsePlusandSimPoolDropBlock2d(also add support to ByobNet based models)Dec 1, 2025
Nov 4, 2025
Oct 31, 2025 🎃
Oct 16-20, 2025
nesterov=True) updates if muon not suitable for parameter shape (or excluded via param group flag)adjust_lr_fnns_coefficientstimmSept 21, 2025
lvd_1689m->lvd1689mto match (same forsat_493m->sat493m)Sept 17, 2025
timmmodel. ViT support done via the EVA base model w/ a newRotaryEmbeddingDinoV3to match the DINOv3 specific RoPE implJuly 23, 2025
set_input_size()method to EVA models, used by OpenCLIP 3.0.0 to allow resizing for timm based encoder models.July 21, 2025
eva.py) including EVA, EVA02, Meta PE ViT,timmSBB ViT w/ ROPE, and Naver ROPE-ViT can be now loaded in NaFlexViT whenuse_naflex=Truepassed at model creation timeJuly 7, 2025
eva.py, add RotaryEmbeddingMixed module for mixed mode, weights on HuggingFace HubJune 26, 2025
June 23, 2025
forward_intermediatesand fix some checkpointing bugs. Thanks https://github.com/brianhou0208June 5, 2025
vision_transformer.pycan be loaded into the NaFlexVit model by adding theuse_naflex=Trueflag tocreate_modeltrain.pyandvalidate.pyadd the--naflex-loaderarg, must be used with a NaFlexVitpython validate.py /imagenet --amp -j 8 --model vit_base_patch16_224 --model-kwargs use_naflex=True --naflex-loader --naflex-max-seq-len 256--naflex-train-seq-lens'argument specifies which sequence lengths to randomly pick from per batch during training--naflex-max-seq-lenargument sets the target sequence length for validation--model-kwargs enable_patch_interpolator=True --naflex-patch-sizes 12 16 24will enable random patch size selection per-batch w/ interpolation--naflex-loss-scalearg changes loss scaling mode per batch relative to the batch size,timmNaFlex loading changes the batch size for each seq lenMay 28, 2025
timmweightsforward_intermediates()and some additional fixes thanks to https://github.com/brianhou0208forward_intermediates()thanks to https://github.com/brianhou0208local-dir:pretrained schema, can uselocal-dir:/path/to/model/folderfor model name to source model / pretrained cfg & weights Hugging Face Hub models (config.json + weights file) from a local folder.Feb 21, 2025
vit_so150m2_patch16_reg1_gap_448.sbb_e200_in12k_ft_in1k- 88.1% top-1vit_so150m2_patch16_reg1_gap_384.sbb_e200_in12k_ft_in1k- 87.9% top-1vit_so150m2_patch16_reg1_gap_256.sbb_e200_in12k_ft_in1k- 87.3% top-1vit_so150m2_patch16_reg4_gap_256.sbb_e200_in12kFeb 1, 2025
timmJan 27, 2025
Jan 19, 2025
vit_so150m_patch16_reg4_gap_256.sbb_e250_in12k_ft_in1k- 86.7% top-1vit_so150m_patch16_reg4_gap_384.sbb_e250_in12k_ft_in1k- 87.4% top-1vit_so150m_patch16_reg4_gap_256.sbb_e250_in12kJan 9, 2025
bfloat16orfloat16wandbproject name arg added by https://github.com/caojiaolong, use arg.experiment for nameJan 6, 2025
torch.utils.checkpoint.checkpoint()wrapper intimm.modelsthat defaultsuse_reentrant=False, unlessTIMM_REENTRANT_CKPT=1is set in env.Dec 31, 2024
convnext_nano384x384 ImageNet-12k pretrain & fine-tune. https://huggingface.co/models?search=convnext_nano%20r384vit_large_patch14_clip_224.dfn2b_s39bRmsNormlayer & fn to match standard formulation, use PT 2.5 impl when possible. Move old impl toSimpleNormlayer, it’s LN w/o centering or bias. There were only twotimmmodels using it, and they have been updated.cache_dirarg for model creationtrust_remote_codefor HF datasets wrapperinception_next_attomodel added by creatorhf-hub:based loading, and thus will work with new TransformersTimmWrapperModelIntroduction
PyTorch Image Models (
timm) is a collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts that aim to pull together a wide variety of SOTA models with ability to reproduce ImageNet training results.The work of many others is present here. I’ve tried to make sure all source material is acknowledged via links to github, arxiv papers, etc in the README, documentation, and code docstrings. Please let me know if I missed anything.
Features
Models
All model architecture families include variants with pretrained weights. There are specific model variants without any weights, it is NOT a bug. Help training new or better weights is always appreciated.
Optimizers
To see full list of optimizers w/ descriptions:
timm.optim.list_optimizers(with_description=True)Included optimizers available via
timm.optim.create_optimizer_v2factory method:adabeliefan implementation of AdaBelief adapted from https://github.com/juntang-zhuang/Adabelief-Optimizer - https://arxiv.org/abs/2010.07468adafactoradapted from FAIRSeq impl - https://arxiv.org/abs/1804.04235adafactorbvadapted from Big Vision - https://arxiv.org/abs/2106.04560adahessianby David Samuel - https://arxiv.org/abs/2006.00719adampandsgdpby Naver ClovAI - https://arxiv.org/abs/2006.08217adamuonandnadamuonas per https://github.com/Chongjie-Si/AdaMuon - https://arxiv.org/abs/2507.11005adanan implementation of Adan adapted from https://github.com/sail-sg/Adan - https://arxiv.org/abs/2208.06677adoptADOPT adapted from https://github.com/iShohei220/adopt - https://arxiv.org/abs/2411.02853kronPSGD w/ Kronecker-factored preconditioner from https://github.com/evanatyourservice/kron_torch - https://sites.google.com/site/lixilinx/home/psgdlamban implementation of Lamb and LambC (w/ trust-clipping) cleaned up and modified to support use with XLA - https://arxiv.org/abs/1904.00962lapropoptimizer from https://github.com/Z-T-WANG/LaProp-Optimizer - https://arxiv.org/abs/2002.04839larsan implementation of LARS and LARC (w/ trust-clipping) - https://arxiv.org/abs/1708.03888lionand implementation of Lion adapted from https://github.com/google/automl/tree/master/lion - https://arxiv.org/abs/2302.06675lookaheadadapted from impl by Liam - https://arxiv.org/abs/1907.08610madgradan implementation of MADGRAD adapted from https://github.com/facebookresearch/madgrad - https://arxiv.org/abs/2101.11075marsMARS optimizer from https://github.com/AGI-Arena/MARS - https://arxiv.org/abs/2411.10438muonMUON optimizer from https://github.com/KellerJordan/Muon with numerous additions and improved non-transformer behaviournadaman implementation of Adam w/ Nesterov momentumnadamwan implementation of AdamW (Adam w/ decoupled weight-decay) w/ Nesterov momentum. A simplified impl based on https://github.com/mlcommons/algorithmic-efficiencynovogradby Masashi Kimura - https://arxiv.org/abs/1905.11286radamby Liyuan Liu - https://arxiv.org/abs/1908.03265rmsprop_tfadapted from PyTorch RMSProp by myself. Reproduces much improved Tensorflow RMSProp behavioursgdwand implementation of SGD w/ decoupled weight-decayfused<name>optimizers by name with NVIDIA Apex installedbnb<name>optimizers by name with BitsAndBytes installedcadamw,clion, and more ‘Cautious’ optimizers from https://github.com/kyleliang919/C-Optim - https://arxiv.org/abs/2411.16085adam,adamw,rmsprop,adadelta,adagrad, andsgdpass through totorch.optimimplementationscsuffix (egadamc,nadamcto implement ‘corrected weight decay’ in https://arxiv.org/abs/2506.02285)Augmentations
Regularization
Other
Several (less common) features that I often utilize in my projects are included. Many of their additions are the reason why I maintain my own set of models, instead of using others’ via PIP:
get_classifierandreset_classifierforward_features(see documentation)create_model(name, features_only=True, out_indices=..., output_stride=...)out_indicescreation arg specifies which feature maps to return, these indices are 0 based and generally correspond to theC(i + 1)feature level.output_stridecreation arg controls output stride of the network by using dilated convolutions. Most networks are stride 32 by default. Not all networks support this..feature_infomemberstep,cosinew/ restarts,tanhw/ restarts,plateauResults
Model validation results can be found in the results tables
Getting Started (Documentation)
The official documentation can be found at https://huggingface.co/docs/hub/timm. Documentation contributions are welcome.
Getting Started with PyTorch Image Models (timm): A Practitioner’s Guide by Chris Hughes is an extensive blog post covering many aspects of
timmin detail.timmdocs is an alternate set of documentation for
timm. A big thanks to Aman Arora for his efforts creating timmdocs.paperswithcode is a good resource for browsing the models within
timm.Train, Validation, Inference Scripts
The root folder of the repository contains reference train, validation, and inference scripts that work with the included models and other features of this repository. They are adaptable for other datasets and use cases with a little hacking. See documentation.
Awesome PyTorch Resources
One of the greatest assets of PyTorch is the community and their contributions. A few of my favourite resources that pair well with the models and components here are listed below.
Object Detection, Instance and Semantic Segmentation
Computer Vision / Image Augmentation
Knowledge Distillation
Metric Learning
Training / Frameworks
Deployment
Licenses
Code
The code here is licensed Apache 2.0. I’ve taken care to make sure any third party code included or adapted has compatible (permissive) licenses such as MIT, BSD, etc. I’ve made an effort to avoid any GPL / LGPL conflicts. That said, it is your responsibility to ensure you comply with licenses here and conditions of any dependent licenses. Where applicable, I’ve linked the sources/references for various components in docstrings. If you think I’ve missed anything please create an issue.
Pretrained Weights
So far all of the pretrained weights available here are pretrained on ImageNet with a select few that have some additional pretraining (see extra note below). ImageNet was released for non-commercial research purposes only (https://image-net.org/download). It’s not clear what the implications of that are for the use of pretrained weights from that dataset. Any models I have trained with ImageNet are done for research purposes and one should assume that the original dataset license applies to the weights. It’s best to seek legal advice if you intend to use the pretrained weights in a commercial product.
Pretrained on more than ImageNet
Several weights included or references here were pretrained with proprietary datasets that I do not have access to. These include the Facebook WSL, SSL, SWSL ResNe(Xt) and the Google Noisy Student EfficientNet models. The Facebook models have an explicit non-commercial license (CC-BY-NC 4.0, https://github.com/facebookresearch/semi-supervised-ImageNet1K-models, https://github.com/facebookresearch/WSL-Images). The Google models do not appear to have any restriction beyond the Apache 2.0 license (and ImageNet concerns). In either case, you should contact Facebook or Google with any questions.
Citing
BibTeX
Latest DOI