Delete .circleci directory (#1366)
Summary: Pull Request resolved: https://github.com/facebookresearch/fastText/pull/1366
Reviewed By: jailby
Differential Revision: D54850920
Pulled By: bigfootjon
fbshipit-source-id: 9a3eec7b7cb42335a786fb247cb16be9ed3c2d59
版权所有:中国计算机学会技术支持:开源发展技术委员会
京ICP备13000930号-9
京公网安备 11010802032778号
fastText
fastText is a library for efficient learning of word representations and sentence classification.
Table of contents
Resources
Models
Supplementary data
FAQ
You can find answers to frequently asked questions on our website.
Cheatsheet
We also provide a cheatsheet full of useful one-liners.
Requirements
We are continuously building and testing our library, CLI and Python bindings under various docker images using circleci.
Generally, fastText builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support. These include :
Compilation is carried out using a Makefile, so you will need to have a working make. If you want to use cmake you need at least version 2.8.9.
One of the oldest distributions we successfully built and tested the CLI under is Debian jessie.
For the word-similarity evaluation script you will need:
For the python bindings (see the subdirectory python) you will need:
One of the oldest distributions we successfully built and tested the Python bindings under is Debian jessie.
If these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.
Building fastText
We discuss building the latest stable version of fastText.
Getting the source code
You can find our latest stable release in the usual place.
There is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.
Building fastText using make (preferred)
This will produce object files for all the classes as well as the main binary
fasttext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).Building fastText using cmake
For now this is not part of a release, so you will need to clone the master branch.
This will create the fasttext binary and also all relevant libraries (shared, static, PIC).
Building fastText for Python
For now this is not part of a release, so you will need to clone the master branch.
For further information and introduction see python/README.md
Example use cases
This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.
Word representation learning
In order to learn word vectors, as described in 1, do:
where
data.txtis a training file containingUTF-8encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files:model.binandmodel.vec.model.vecis a text file containing the word vectors, one per line.model.binis a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.Obtaining word vectors for out-of-vocabulary words
The previously trained model can be used to compute word vectors for out-of-vocabulary words. Provided you have a text file
queries.txtcontaining words for which you want to compute vectors, use the following command:This will output word vectors to the standard output, one vector per line. This can also be used with pipes:
See the provided scripts for an example. For instance, running:
will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].
Text classification
This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in 2, use:
where
train.txtis a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string__label__. This will output two files:model.binandmodel.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k (P@k and R@k) on a test set using:The argument
kis optional, and is equal to1by default.In order to obtain the k most likely labels for a piece of text, use:
or use
predict-probto also get the probability for each labelwhere
test.txtcontains a piece of text to classify per line. Doing so will print to the standard output the k most likely labels for each line. The argumentkis optional, and equal to1by default. Seeclassification-example.shfor an example use case. In order to reproduce results from the paper 2, runclassification-results.sh, this will download all the datasets and reproduce the results from Table 1.If you want to compute vector representations of sentences or paragraphs, please use:
This assumes that the
text.txtfile contains the paragraphs that you want to get vectors for. The program will output one vector representation per line in the file.You can also quantize a supervised model to reduce its memory usage with the following command:
This will create a
.ftzfile with a smaller memory footprint. All the standard functionality, liketestorpredictwork the same way on the quantized models:The quantization procedure follows the steps described in 3. You can run the script
quantization-example.shfor an example.Full documentation
Invoke a command without arguments to list available arguments and their default values:
Defaults may vary by mode. (Word-representation modes
skipgramandcbowuse a default-minCountof 5.)References
Please cite 1 if using this code for learning word representations or 2 if using for text classification.
Enriching Word Vectors with Subword Information
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
Bag of Tricks for Efficient Text Classification
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
FastText.zip: Compressing text classification models
[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models
(* These authors contributed equally.)
Join the fastText community
See the CONTRIBUTING file for information about how to help out.
License
fastText is MIT-licensed.