Summary

This is the Spanish Billion Words Corpus and Embeddings linguistic resource.

This resource consists of an unannotated corpus of the Spanish language of nearly 1.5 billion words, compiled from different corpora and resources from the web; and a set of word vectors (or embeddings), created from this corpus using the word2vec algorithm, provided by the gensim package. These embeddings were evaluated by translating to Spanish word2vec’s word relation test set.

The cleaned corpus is publicly available to download as raw text file. The word vectors are also available to download in word2vec’s binary format and in text format.

Disclaimer

I tried to find all the Spanish corpora, from different sources, that were available to download freely from the web. No copyright infringement was intended and I also tried to acknowledge correctly all the original authors of the corpora I borrowed.

If you are an author of one of the resources and feel your work wasn’t correctly used in this resource please feel free to contact me and I will remove your work from this corpus and from the embeddings.

Likewise, if you are author or know of some other resources publicly available for the Spanish language (the corpus doesn’t need to be annotated) and want to contribute, also feel free to contact me.

Corpora

The corpus was created compiling the following resources of the Spanish language:

Corpus Processing

All the annotated corpora (like Ancora, SenSem and Tibidabo) were untagged, since word2vec works with unannotated data. The parallel corpora (most coming from the OPUS Project) was preprocessed to obtain only the Spanish portions of it.

Once we had the whole corpus unannotated, we proceed to replace all non-alphanumeric characters with whitespaces. All numbers with the token “DIGITO” and all the multiple whitespaces with only one whitespace.

The capitalization of the words remain unchanged.

Parameters for Embeddings Training

To train the word embeddings we used the following parameters:

Description of Resulting Embeddings

The original corpus had the following amount of data:

After the skip-gram model was applied, filtering of words with less than 5 occurrences as well as the downsample of the 273 most common words, the following values were obtained:

The final resource was a corpus of 1000653 word embeddings of dimension 300.

Evaluation of the Embeddings

This corpus was evaluated using a translation of word2vec’s question words. Those translations that resulted in an ambiguity not intended in the original test set were removed (for example, names of currencies that were homographs with the word “crown” when translated to Spanish), resulting in a test set that was 25% smaller than the original.

We obtained the following accuracies:

Citation

To cite this resource in a publication please use the following citation:

Cristian Cardellino: Spanish Billion Words Corpus and Embeddings (March 2016), http://crscardellino.me/SBWCE/

You also have a bibtex entry available.

License

Creative Commons License
Spanish Billion Word Corpus and Embeddings by Cristian Cardellino
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.