Obtain MetadataRetriever results. Since MetadataRetriever still need to search webpages, to reduce http requests as many as possible, we run it by stages.
The 1st stage: use --all option to search repository urls from the home_page``, download_url, project_urls, and description` field in the metadata.
python -m dataset.run_metadata_retriever --all
The 2nd stage: use --left_release option to get all repository urls in the unique homepage and documentation webpage in the left releases whose metadata does not have repository url.
RADAR: Towards Automatic Source Code Repository Information Recovery and Validation for PyPI Packages
Environment Setup
Folder Structure
Run Scripts
Dump PyPI package metadata
Import metadata to MongoDB. We provide the dump in the
mongodb
folder (release_metadata.bson.gz
anddistribution_file_info.bson.gz
).Obtain baseline results. We provide a MongoDB dump in the
mongodb
folder (baseline_results.bson.gz
)You can also obtain results of a single release by passing
--name
and--version
arguments:Obtain MetadataRetriever results. Since MetadataRetriever still need to search webpages, to reduce http requests as many as possible, we run it by stages.
The 1st stage: use
--all
option to search repository urls from thehome_page``,
download_url,
project_urls, and
description` field in the metadata.The 2nd stage: use
--left_release
option to get all repository urls in the unique homepage and documentation webpage in the left releases whose metadata does not have repository url.The 3rd stage: use
--process_log
option to process failed urls in the 3nd stage.The 4th stage: use
--merge
option to merge retrived repository url for each webpage in the 3rd stage to MetadataRetriever results:The 5th stage: use
--redirect
option to get the redirected url of each repository urls retrived by MetadataRetriever:You can also obtain results of a single release by passing
--name
and--version
arguments. There are some options:--webpage
: search the webpage pointed by the Homepage and Documentation links in the metadata--redirect
: get the redirected url of the retrieved repository urlList repositories’s blobs based on metadata retriever results:
Compare the difference between source distributions and binary distributions
construct dataset:
fit machine learning models on validator features:
11.