Sequential Hidden Decoding

Scaling Sequence Length in Pretraining

WeChat AI, Tencent

Scale sequence length by n× with only Embedding parameters — same Transformer, more compute per token

Updates

2026-04-01: Released Sequential-Hidden-Decoding-8B-n8-Instruct — our first instruction-tuned model, outperforming Qwen3-8B-Instruct on all 8 benchmarks. Added Chat UI for interactive conversations.
2026-03-10: Initial release of base models (n=2, n=4, n=8) and SGLang inference patch.

Key Idea

Prepare n independent Embedding matrices to encode the same token sequence n times, interleave the results, and feed the n×-length sequence into the same Transformer. Only the last embedding of each token computes the next-token loss, while the preceding embeddings serve as implicit reasoning steps in a continuous latent space.

Results

Base Model

Evaluated on Qwen3-8B-Base with progressive Sequential Hidden Decoding scaling (non-thinking, base model):

Benchmark	# Shots	8B Baseline	8B scale n=2	8B scale n=4	8B scale n=8
BBH (EM)	3-shot	78.8	81.3	83.0	83.9
MMLU (EM)	5-shot	79.8	80.9	81.9	82.2
MBPP+ (Pass@1)	1-shot	66.7	69.4	68.7	69.4
MATH (LLM-judge)	4-shot	56.0	58.2	60.0	61.1
ARC-C	25-shot	93.9	94.3	94.4	94.7
Hellaswag	10-shot	79.7	83.1	85.0	85.3
GSM8K	4-shot	92.5	93.3	93.9	94.6

Instruct Model

Instruction-tuned from Sequential-Hidden-Decoding-8B-n8, compared with:

Qwen3-8B-Instruct: the official instruction-tuned model released by the Qwen team.
Qwen3-8B SFT: our own SFT baseline trained on Qwen3-8B-Base with exactly the same data as Scale-Seq n=8 SFT, but without Sequential Hidden Decoding. This isolates the effect of scale-seq from the training data.

Benchmark	Category	Qwen3-8B-Instruct	Qwen3-8B SFT	Scale-Seq n=8 SFT
AIME24	Math	29.42	19.09	30.00
AIME25	Math	20.58	17.67	26.25
MATH500	Math	85.16	84.19	89.04
GPQA Diamond	Reasoning	48.79	50.91	55.25
CMMLU	Knowledge	78.20	75.05	81.15
C-Eval	Knowledge	78.67	76.04	83.20
MMLU-Pro	Knowledge	65.84	68.35	75.70
SuperGPQA	Knowledge	36.79	41.20	48.35

Sequential Hidden Decoding n=8 SFT outperforms Qwen3-8B-Instruct on all 8 benchmarks, including competitive math benchmarks (AIME24/25) and knowledge-intensive tasks (MMLU-Pro, SuperGPQA). Sampling: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5, max_tokens=4096. Judge: GPT-4o.

Models

Base Models

All models share the same 8B Transformer backbone — only the Embedding parameters grow:

Model	Scale	Embedding Params	Training Tokens	Link
Sequential-Hidden-Decoding-8B-n2	2×	1.9B	75B	HuggingFace
Sequential-Hidden-Decoding-8B-n4	4×	3.1B	150B	HuggingFace
Sequential-Hidden-Decoding-8B-n8	8×	5.6B	187B	HuggingFace

Instruct Models

Model	Scale	Base Model	Link
Sequential-Hidden-Decoding-8B-n8-Instruct	8×	Sequential-Hidden-Decoding-8B-n8	HuggingFace

Installation (Inference)

Option 1: Docker Image (Recommended)

docker pull aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126

Option 2: Use the Forked SGLang

git clone https://github.com/exlaw/sglang.git
cd sglang
pip install -e "python[all]"

Option 3: Apply Patch Manually

Apply the patch on top of SGLang:

git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout 4efe844a2
git apply /path/to/hidden_decoding.patch
pip install -e "python[all]"

Serving

Launch the server:

python -m sglang.launch_server \
    --model-path tencent/Sequential-Hidden-Decoding-8B-n8-Instruct \
    --trust-remote-code \
    --tp-size 1 \
    --port 8080 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128

Docker

docker run --gpus all -p 8080:8080 -p 8081:8081 \
  -v /path/to/models:/models \
  aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126 \
  bash -c "python -m sglang.launch_server \
    --model-path /models/Sequential-Hidden-Decoding-8B-n8-Instruct \
    --trust-remote-code \
    --tp-size 1 \
    --port 8080 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128"

Note: Sequential Hidden Decoding models process n×-length sequences internally, so --chunked-prefill-size -1 (disable chunked prefill), --attention-backend fa3, and reduced batch sizes are important for stability and performance. Adjust --tp-size for multi-GPU setups.

Usage

Chat Completions (Instruct Model)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the concept of hidden decoding in simple terms."},
    ],
    max_tokens=512,
    temperature=0.7,
)
print(response.choices[0].message.content)

Text Completions (Base Models)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="EMPTY")
response = client.completions.create(
    model="default",
    prompt="The meaning of life is",
    max_tokens=128,
    temperature=0,
)
print(response.choices[0].text)

Chat UI

We provide a lightweight, zero-dependency web chat interface for interactive conversations:

python chat_ui.py

Then open http://localhost:8081 in your browser. The Chat UI connects to http://localhost:8080/v1 by default.

Docker (serve model + Chat UI together)

docker run --gpus all -p 8080:8080 -p 8081:8081 \
  -v /path/to/models:/models \
  -v /path/to/chat_ui.py:/app/chat_ui.py \
  aiweiliu/sglang-scale-seq:v0.5.2rc2-cu126 \
  bash -c "python -m sglang.launch_server \
    --model-path /models/Sequential-Hidden-Decoding-8B-n8-Instruct \
    --trust-remote-code \
    --tp-size 1 \
    --port 8080 --host 0.0.0.0 \
    --chunked-prefill-size -1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.82 \
    --max-running-requests 32 \
    --context-length 131072 \
    --cuda-graph-max-bs 128 \
    --cuda-graph-bs 1 2 4 8 16 32 64 128 &
  sleep 5 && python /app/chat_ui.py"

Open http://localhost:8081 — it connects to the model server automatically.

Features:

Streaming responses with real-time token speed display
Thinking/reasoning block visualization
Configurable system prompt, temperature, and max tokens
Zero dependencies — pure Python standard library

Patch Contents

The patch adds the qwen3_scale_seq model architecture and modifies the scheduler, batch manager, and CUDA graph runner to handle the expanded sequence length.

Citation

If you find this work useful, please cite our blog post:

@article{hidden_decoding_2026,
  title   = {Sequential Hidden Decoding: Scaling Sequence Length in Pretraining},
  year    = {2026},
  url     = {https://welm.weixin.qq.com/posts/hidden_decoding/}
}

Contact

Sijun Zhang (nepheloturbulence@gmail.com), Aiwei Liu (liuaiwei20@gmail.com)

License

This project is released under the License Terms of Sequential-Hidden-Decoding. The dependent open-source models and software components remain licensed under their respective original licenses — see the LICENSE file for details.