Introduction

This repository contains the implementation of AdaNDV. We provide details in our paper, including but not limited to the train/validation/test dataset splits, preprocessed data, and model training. You can obtain the results presented in our paper by following the instructions below.

The paper is publicly available at arxiv.

If you find our work useful, please cite the paper:

@article{xu2024adandv,
  title={AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators},
  author={Xu, Xianghong and Zhang, Tieying and He, Xiao and Li, Haoyang and Kang, Rong and Wang, Shuai and Xu, Linhui and Liang, Zhimin and Luo, Shangyu and Zhang, Lei and others},
  journal={Proceedings of the VLDB Endowment},
  volume={18},
  number={4},
  pages={1104--1117},
  year={2024},
  publisher={VLDB Endowment}
}

Instruction

Reproduce the results in the paper

We are unable to provide the implementation of statistical estimators (base estimators) and the raw data in this repository due to license issues. You can reproduce the results in the paper by following the instructions below.

Establish the experimental environment.

pip3 install -r requirement.txt

Download the TabLib dataset from Huggingface, and put the parquet files in a folder.

tablib-sample
|-0ad462e9.parquet
|-......

The train/validation/test datasets are split by:

import os
file_list = os.listdir('tablib-sample') # all parquet files
# We ignore three files ('2d7d54b8', '8e1450ee', 'dc0e820c') for the memory issue
file_list = sorted(file_list) # fix the orders
train_size = int(len(file_list) * 0.6)
test_size = int(len(file_list) * 0.2)
val_size = len(file_list) - train_size - test_size
train_files = file_list[:train_size]
test_files = file_list[train_size:train_size+test_size] 
val_files = file_list[-val_size:]

Implement the base traditional estimators. Refer to pydistinct and the paper for details, we can not provide the code of them for license issues.
Sampling and preprocess data. You will get the three pickle files in the data/ folder.

Get the table content by following the instructions provided by TabLib)
For each column, record the number of rows (N), ground truth NDV (D), uniformly at random sample 1% data to build frequency profile (f).
Each pickle file in the data/ folder has 4 lists, the basic item of each list is shown as follows.
- data_profile: f[1:H-3] || logn || logd || logN
- rank_label: y^over || y^under
- esimate_ndv: estimation results (in log) of the following base estimators: ['EB', 'GEE', 'Chao', 'Shlosser', 'ChaoLee', 'Goodman', 'Jackknife', 'Sichel', 'Method of Movement', 'Bootstrap', 'Horvitz Thompson', 'Method of Movement v2', 'Method of Movement v3', 'Smoothed Jackknife']
- D_list: ground truth NDV

For more details, refer to process_data.py and implement it according to the comments. Then execute the script.

python3 process_data.py

Train the model and observe the results.

python3 train_adandv.py

Show the q-error distributions of base estimators and the learned estimator (the primary results of Table 3 in the paper).

python3 show_base.py

LICENSE

The code in this repository is licensed under MIT LICENSE