This is the Spanish Billion Words Corpus and Embeddings linguistic resource.
This resource consists of an unannotated corpus of the Spanish language of nearly 1.5 billion words, compiled from different corpora and resources from the web; and a set of word vectors (or embeddings), created from this corpus using the word2vec algorithm, provided by the gensim package. These embeddings were evaluated by translating to Spanish word2vec’s word relation test set.
The cleaned corpus is publicly available to download as raw text file. The word vectors are also available to download in word2vec’s binary format and in text format.
I tried to find all the Spanish corpora, from different sources, that were available to download freely from the web. No copyright infringement was intended and I also tried to acknowledge correctly all the original authors of the corpora I borrowed.
If you are an author of one of the resources and feel your work wasn’t correctly used in this resource please feel free to contact me and I will remove your work from this corpus and from the embeddings.
Likewise, if you are author or know of some other resources publicly available for the Spanish language (the corpus doesn’t need to be annotated) and want to contribute, also feel free to contact me.
The corpus was created compiling the following resources of the Spanish language:
- Spanish portion of SenSem.
- Spanish portion of the Ancora Corpus.
- Tibidabo Treebank and IULA Spanish LSP Treebank.
- The Spanish portion of the following OPUS Project Corpora:
- The Spanish portion of the Europarl (European Parliament), compiled by Philipp Koehn.
- Dumps from the Spanish Wikipedia, Wikisource and Wikibooks on date 2015-09-01, parsed with the Wikipedia Extractor.
All the annotated corpora (like Ancora, SenSem and Tibidabo) were untagged, since word2vec works with unannotated data. The parallel corpora (most coming from the OPUS Project) was preprocessed to obtain only the Spanish portions of it.
Once we had the whole corpus unannotated, we proceed to replace all non-alphanumeric characters with whitespaces. All numbers with the token “DIGITO” and all the multiple whitespaces with only one whitespace.
The capitalization of the words remain unchanged.
Parameters for Embeddings Training
To train the word embeddings we used the following parameters:
- The selected algorithm was the skip-gram model with negative-sampling.
- The minumum word frequency was 5.
- The amount of “noise words” for the negative sampling was 20.
- The 273 most common words were downsampled.
- The dimension of the final word embedding was 300.
Description of Resulting Embeddings
The original corpus had the following amount of data:
- A total of 1420665810 raw words.
- A total of 46925295 sentences.
- A total of 3817833 unique tokens.
After the skip-gram model was applied, filtering of words with less than 5 occurrences as well as the downsample of the 273 most common words, the following values were obtained:
- A total of 771508817 raw words.
- A total of 1000653 unique tokens.
The final resource was a corpus of 1000653 word embeddings of dimension 300.
Evaluation of the Embeddings
This corpus was evaluated using a translation of word2vec’s question words. Those translations that resulted in an ambiguity not intended in the original test set were removed (for example, names of currencies that were homographs with the word “crown” when translated to Spanish), resulting in a test set that was 25% smaller than the original.
We obtained the following accuracies:
- Capital of common countries: 0.84
- Capitals of the World: 0.68
- City in state: 0.27
- Currency: 0.08
- Family: 0.80
- Adjective to adverbs: 0.21
- Opposite: 0.24
- Present participle: 0.73
- Nationality adjective: 0.28
- Past tense: 0.25
- Plural: 0.51
- Plural verbs: 0.42
To cite this resource in a publication please use the following citation:
Cristian Cardellino: Spanish Billion Words Corpus and Embeddings (March 2016), http://crscardellino.me/SBWCE/
You also have a bibtex entry available.
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.