Scale sequence length by n× with only Embedding parameters — same Transformer, more compute per token
Updates
2026-04-01: Released Sequential-Hidden-Decoding-8B-n8-Instruct — our first instruction-tuned model, outperforming Qwen3-8B-Instruct on all 8 benchmarks. Added Chat UI for interactive conversations.
2026-03-10: Initial release of base models (n=2, n=4, n=8) and SGLang inference patch.
Key Idea
Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.
Results
Base Model
Evaluated on Qwen3-8B-Base with progressive Sequential Hidden Decoding scaling (non-thinking, base model):
Benchmark
# Shots
8B Baseline
8B scale n=2
8B scale n=4
8B scale n=8
BBH (EM)
3-shot
78.8
81.3
83.0
83.9
MMLU (EM)
5-shot
79.8
80.9
81.9
82.2
MBPP+ (Pass@1)
1-shot
66.7
69.4
68.7
69.4
MATH (LLM-judge)
4-shot
56.0
58.2
60.0
61.1
ARC-C
25-shot
93.9
94.3
94.4
94.7
Hellaswag
10-shot
79.7
83.1
85.0
85.3
GSM8K
4-shot
92.5
93.3
93.9
94.6
Instruct Model
Instruction-tuned from Sequential-Hidden-Decoding-8B-n8, compared with:
Qwen3-8B-Instruct: the official instruction-tuned model released by the Qwen team.
Qwen3-8B SFT: our own SFT baseline trained on Qwen3-8B-Base with exactly the same data as Scale-Seq n=8 SFT, but without Sequential Hidden Decoding. This isolates the effect of scale-seq from the training data.
Benchmark
Category
Qwen3-8B-Instruct
Qwen3-8B SFT
Scale-Seq n=8 SFT
AIME24
Math
29.42
19.09
30.00
AIME25
Math
20.58
17.67
26.25
MATH500
Math
85.16
84.19
89.04
GPQA Diamond
Reasoning
48.79
50.91
55.25
CMMLU
Knowledge
78.20
75.05
81.15
C-Eval
Knowledge
78.67
76.04
83.20
MMLU-Pro
Knowledge
65.84
68.35
75.70
SuperGPQA
Knowledge
36.79
41.20
48.35
Sequential Hidden Decoding n=8 SFT outperforms Qwen3-8B-Instruct on all 8 benchmarks, including competitive math benchmarks (AIME24/25) and knowledge-intensive tasks (MMLU-Pro, SuperGPQA). Sampling: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5, max_tokens=4096. Judge: GPT-4o.
Models
Base Models
All models share the same 8B Transformer backbone — only the Embedding parameters grow:
Note: Sequential Hidden Decoding models process n×-length sequences internally, so --chunked-prefill-size -1 (disable chunked prefill), --attention-backend fa3, and reduced batch sizes are important for stability and performance. Adjust --tp-size for multi-GPU setups.
Usage
Chat Completions (Instruct Model)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the concept of hidden decoding in simple terms."},
],
max_tokens=512,
temperature=0.7,
)
print(response.choices[0].message.content)
Text Completions (Base Models)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="EMPTY")
response = client.completions.create(
model="default",
prompt="The meaning of life is",
max_tokens=128,
temperature=0,
)
print(response.choices[0].text)
Chat UI
We provide a lightweight, zero-dependency web chat interface for interactive conversations:
python chat_ui.py
Then open http://localhost:8081 in your browser. The Chat UI connects to http://localhost:8080/v1 by default.
Open http://localhost:8081 — it connects to the model server automatically.
Features:
Streaming responses with real-time token speed display
Thinking/reasoning block visualization
Configurable system prompt, temperature, and max tokens
Zero dependencies — pure Python standard library
Patch Contents
The patch adds the qwen3_scale_seq model architecture and modifies the scheduler, batch manager, and CUDA graph runner to handle the expanded sequence length.
Citation
If you find this work useful, please cite our blog post:
@article{hidden_decoding_2026,
title = {Sequential Hidden Decoding: Scaling Sequence Length in Pretraining},
year = {2026},
url = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}
This project is released under the License Terms of Sequential-Hidden-Decoding. The dependent open-source models and software components remain licensed under their respective original licenses — see the LICENSE file for details.
Sequential Hidden Decoding
Scaling Sequence Length in Pretraining
WeChat AI, Tencent
Scale sequence length by n× with only Embedding parameters — same Transformer, more compute per token
Updates
Key Idea
Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.
Results
Base Model
Evaluated on Qwen3-8B-Base with progressive Sequential Hidden Decoding scaling (non-thinking, base model):
Instruct Model
Instruction-tuned from Sequential-Hidden-Decoding-8B-n8, compared with:
Models
Base Models
All models share the same 8B Transformer backbone — only the Embedding parameters grow:
Instruct Models
Installation (Inference)
Option 1: Docker Image (Recommended)
Option 2: Use the Forked SGLang
Option 3: Apply Patch Manually
Apply the patch on top of SGLang:
Serving
Launch the server:
Docker
Usage
Chat Completions (Instruct Model)
Text Completions (Base Models)
Chat UI
We provide a lightweight, zero-dependency web chat interface for interactive conversations:
Then open
http://localhost:8081in your browser. The Chat UI connects tohttp://localhost:8080/v1by default.Docker (serve model + Chat UI together)
Open
http://localhost:8081— it connects to the model server automatically.Features:
Patch Contents
The patch adds the
qwen3_scale_seqmodel architecture and modifies the scheduler, batch manager, and CUDA graph runner to handle the expanded sequence length.Citation
If you find this work useful, please cite our blog post:
Contact
Sijun Zhang (nepheloturbulence@gmail.com), Aiwei Liu (liuaiwei20@gmail.com)
License
This project is released under the License Terms of Sequential-Hidden-Decoding. The dependent open-source models and software components remain licensed under their respective original licenses — see the LICENSE file for details.