Data Set Download
IMDB Film Review Data Set
Download source: http://ai.stanford.edu/~amaas/data/sentiment
It takes a long time to download and decompress data sets, perhaps because decompressing tar.gz file format in windows is relatively troublesome.
After the data set is acquired, there is a Readme document describing the basic situation of the data set.
Large Movie Review Dataset v1.0
1. Core data sets include 50 k Comments with emotional tags were averagely divided into 25 k Bar Training Data Set and 25 k Strip test data sets, and the distribution of labels is uniform, including 25 k strip pos And 25 k strip neg,The data set also contains 50 k An unlabeled comment. 2. In the whole data set, there are no more than 30 reviews for any movie, in order to avoid the correlation score between the same movie. 3.meanwhile train and test There are no related film collections. 4.Labels are marked with a score of 10 points, less than or equal to 4 points. negative,More than or equal to 7 minutes positive,So in fact, the score itself can be reflected here. neg and pos The degree of each is in train and test Very few. neutral Scores are relatively scattered in unlabeled data sets, and the comment data of each score are available. i. There are also specific specifications for file format and naming, which are not translated here. There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directiories for the reviews with binary labels positive and nagative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and start rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset. ii. At the same time, there is a ready word bag model in the file format..feat files - LIBSVM format //File format: rating 0:7 (the number of times the first word in the dictionary appears in the comment) LIBSVM Detailed format: LIBSVM is an integrated software for support vector classification, (C-SVC, nu-SVC), regression (epsilon-SVR, nu-SVR) and distribution estimation (one-class SVM). It supports multi-class classification. //This page is recommended for the introduction of the page, but eh, let's choose the best one: https://www.csie.ntu.edu.tw/~cjlin/libsvm/(English version). I think the general introduction is that this is a compiled and usable SVM model source code package, but eh, what does it have to do with the file format, I really don't particularly understand, kneel down and ask God to explain it. iii. about imdbEr.txt This document is about token Of rating,Er, I don't particularly understand the truth, but I don't care much about it considering that it's not needed for the time being. iv. License Statement: Take it as a habit. ``` @InProceedings{maas-EtAl:2011:ACL-HLT2011, author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, title = {Learning Word Vectors for Sentiment Analysis}, booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, month = {June}, year = {2011}, address = {Portland, Oregon, USA}, publisher = {Association for Computational Linguistics}, pages = {142--150}, url = {http://www.aclweb.org/anthology/P11-1015} } ```
b. THUCNews Chinese News Data Set
Source: http://thuctc.thunlp.org/sendMessage (Please choose the appropriate version (install X) to download -).
Brief introduction: There is nothing special to say about Chinese dataset. Before feature extraction, a file corresponds to an article anyway, and after looking at it, it is found that the first line of the file contains a summary of the article, which can be considered as the details of subsequent analysis.
c. Recall/Accuracy/F-score/Confusion Matrix
Well, I think wiki writes quite well. Borrowed.
https://en.wikipedia.org/wiki/Precision_and_recall
PS: The typesetting is really hard. I would like to use oneNote. We need to study how to combine well.