This repository contains the implementation of AdaNDV. We provide details in our paper, including but not limited to the train/validation/test dataset splits, preprocessed data, and model training. You can obtain the results presented in our paper by following the instructions below.
If you find our work useful, please cite the paper:
@article{xu2024adandv,
title={AdaNDV: Adaptive Number of Distinct Value Estimation via Learning to Select and Fuse Estimators},
author={Xu, Xianghong and Zhang, Tieying and He, Xiao and Li, Haoyang and Kang, Rong and Wang, Shuai and Xu, Linhui and Liang, Zhimin and Luo, Shangyu and Zhang, Lei and others},
journal={Proceedings of the VLDB Endowment},
volume={18},
number={4},
pages={1104--1117},
year={2024},
publisher={VLDB Endowment}
}
Instruction
Reproduce the results in the paper
We are unable to provide the implementation of statistical estimators (base estimators) and the raw data in this repository due to license issues. You can reproduce the results in the paper by following the instructions below.
Establish the experimental environment.
pip3 install -r requirement.txt
Download the TabLib dataset from Huggingface, and put the parquet files in a folder.
tablib-sample
|-0ad462e9.parquet
|-......
The train/validation/test datasets are split by:
import os
file_list = os.listdir('tablib-sample') # all parquet files
# We ignore three files ('2d7d54b8', '8e1450ee', 'dc0e820c') for the memory issue
file_list = sorted(file_list) # fix the orders
train_size = int(len(file_list) * 0.6)
test_size = int(len(file_list) * 0.2)
val_size = len(file_list) - train_size - test_size
train_files = file_list[:train_size]
test_files = file_list[train_size:train_size+test_size]
val_files = file_list[-val_size:]
Implement the base traditional estimators. Refer to pydistinct and the paper for details, we can not provide the code of them for license issues.
Sampling and preprocess data. You will get the three pickle files in the data/ folder.
Get the table content by following the instructions provided by TabLib)
For each column, record the number of rows (N), ground truth NDV (D), uniformly at random sample 1% data to build frequency profile (f).
Each pickle file in the data/ folder has 4 lists, the basic item of each list is shown as follows.
data_profile: f[1:H-3] || logn || logd || logN
rank_label: y^over || y^under
esimate_ndv: estimation results (in log) of the following base estimators: ['EB', 'GEE', 'Chao', 'Shlosser', 'ChaoLee', 'Goodman', 'Jackknife', 'Sichel', 'Method of Movement', 'Bootstrap', 'Horvitz Thompson', 'Method of Movement v2', 'Method of Movement v3', 'Smoothed Jackknife']
D_list: ground truth NDV
For more details, refer to process_data.py and implement it according to the comments. Then execute the script.
python3 process_data.py
Train the model and observe the results.
python3 train_adandv.py
Show the q-error distributions of base estimators and the learned estimator (the primary results of Table 3 in the paper).
python3 show_base.py
LICENSE
The code in this repository is licensed under MIT LICENSE
Introduction
This repository contains the implementation of AdaNDV. We provide details in our paper, including but not limited to the train/validation/test dataset splits, preprocessed data, and model training. You can obtain the results presented in our paper by following the instructions below.
Instruction
Reproduce the results in the paper
We are unable to provide the implementation of statistical estimators (base estimators) and the raw data in this repository due to license issues. You can reproduce the results in the paper by following the instructions below.
data/folder.Get the table content by following the instructions provided by TabLib)
For each column, record the number of rows (N), ground truth NDV (D), uniformly at random sample 1% data to build frequency profile (f).
Each pickle file in the
data/folder has 4 lists, the basic item of each list is shown as follows.['EB', 'GEE', 'Chao', 'Shlosser', 'ChaoLee', 'Goodman', 'Jackknife', 'Sichel', 'Method of Movement', 'Bootstrap', 'Horvitz Thompson', 'Method of Movement v2', 'Method of Movement v3', 'Smoothed Jackknife']process_data.pyand implement it according to the comments. Then execute the script.LICENSE
The code in this repository is licensed under MIT LICENSE