The matlab file contains the following variables: words - cell array with list of words (14036 words) docs_names - cell array names of documents (2484 docs) format of names is YYYY/paper_name (e.g. 2003/AA01) authors_names - cell array with authors names (2865 authors) format of names is 'Sejnowski_T' docs_authors - n_docs x n_authors sparse binary matrix. has "1" for each author of a document. counts - n_words x n_docs count matrix aw_counts - n_words x n_authors count matrix Words from each document were assigned to each of the authors, after dividing by the number of authors of that document. Each doc count was first normalized to a sum of 1. to correct for document length variability. Unnormalized counts can be obtained using counts and docs_authors; For all document, publication year can be extracted from the name of the document. For documents published in 2000-2003, the section/area (AA,AP...) can be extracted from the name of the document.