fix: scalar quantization can’t work with NaNs (#3476)
Address potential out-of-range issues when scaling values to
u8in theScalarQuantizer. Introduce a test case to handle NaN values in the scaling function.
Signed-off-by: BubbleCal bubble-cal@outlook.com
Modern columnar data format for ML. Convert from Parquet in 2-lines of code for 100x faster random access, a vector index, data versioning, and more.
Compatible with pandas, DuckDB, Polars, and pyarrow with more integrations on the way.
Documentation • Blog • Discord • Twitter
Lance is a modern columnar data format that is optimized for ML workflows and datasets. Lance is perfect for:
The key features of Lance include:
High-performance random access: 100x faster than Parquet without sacrificing scan performance.
Vector search: find nearest neighbors in milliseconds and combine OLAP-queries with vector search.
Zero-copy, automatic versioning: manage versions of your data without needing extra infrastructure.
Ecosystem integrations: Apache Arrow, Pandas, Polars, DuckDB and more on the way.
Quick Start
Installation
To install a preview release:
Converting to Lance
Reading Lance data
Pandas
DuckDB
Vector search
Download the sift1m subset
Convert it to Lance
Build the index
Search the dataset
Directory structure
What makes Lance different
Here we will highlight a few aspects of Lance’s design. For more details, see the full Lance design document.
Vector index: Vector index for similarity search over embedding space. Support both CPUs (
x86_64andarm) and GPU (Nvidia (cuda)andApple Silicon (mps)).Encodings: To achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.
Nested fields: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.
Versioning: A Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation .
Fast updates (ROADMAP): Updates will be supported via write-ahead logs.
Rich secondary indices (ROADMAP):
Benchmarks
Vector search
We used the SIFT dataset to benchmark our results with 1M vectors of 128D
Vs. parquet
We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/XMLs. For analytics queries, Lance is 50-100x better than reading the raw metadata. For batched random access, Lance is 100x better than both parquet and raw files.
Why are you building yet another data format?!
The machine learning development cycle involves the steps:
graph LR A[Collection] --> B[Exploration]; B --> C[Analytics]; C --> D[Feature Engineer]; D --> E[Training]; E --> F[Evaluation]; F --> C; E --> G[Deployment]; G --> H[Monitoring]; H --> A;People use different data representations to varying stages for the performance or limited by the tooling available. Academia mainly uses XML / JSON for annotations and zipped images/sensors data for deep learning, which is difficult to integrated into data infrastructure and slow to train over cloud storage. While industry uses data lakes (Parquet-based techniques, i.e., Delta Lake, Iceberg) or data warehouses (AWS Redshift or Google BigQuery) to collect and analyze data, they have to convert the data into training-friendly formats, such as Rikai/Petastorm or TFRecord. Multiple single-purpose data transforms, as well as syncing copies between cloud storage to local training instances have become a common practice.
While each of the existing data formats excels at the workload it was originally designed for, we need a new data format tailored for multistage ML development cycles to reduce and data silos.
A comparison of different data formats in each stage of ML development cycle.
Community Highlights
Lance is currently used in production by:
Presentations and Talks