Preface
Most of this article is copied from Relational Extraction Dataset NYT-10 SemEval2010
1. What is NYT-10?
NYT-10 data is published in Riedel et al. 2010. The text of this paper comes from the corpus labeled by New York Times in the New York Times. Named entities are labeled with the Stanford NER tool in conjunction with the Freebase repository. The relationship between named entity pairs is derived by linking to the relationship in the Freebase repository outside of the reference and by combining remote monitoring methods.
2. Data Download
1.OpenNRE
Data can be downloaded directly from this link: https://github.com/thunlp/OpenNRE/tree/master/benchmark
Take download_nyt10.sh as an example and its contents are as follows:
mkdir nyt10 wget -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_rel2id.json wget -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_train.txt wget -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_test.txt
My steps are as follows:
- First you need to install wget. (Use the installation method directly obtained by Baidu.)
- Open the cmd under the target file you want to download. (Or open the cmd and move it to the target file.)
- Change the four commands above to the following:
mkdir nyt10 wget --no-check-certificate -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_rel2id.json wget --no-check-certificate -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_train.txt wget --no-check-certificate -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_test.txt
If you do not modify it, you will get an error as follows. (My system is windows.)
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc syswgetrc = D:\wget\GnuWin32/etc/wgetrc --2021-09-25 23:26:05-- https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_rel2id.json Resolving Host thunlp.oss-cn-qingdao.aliyuncs.com... 119.167.128.167, 119.167.128.167 Connecting to thunlp.oss-cn-qingdao.aliyuncs.com|119.167.128.167|:443... Connected. ERROR: cannot verify thunlp.oss-cn-qingdao.aliyuncs.com's certificate, issued by `/C=BE/O=GlobalSign nv-sa/CN=GlobalSign Organization Validation CA - SHA256 - G2': Unable to locally verify the issuer's authority. ERROR: certificate common name `*.oss-cn-beijing.aliyuncs.com' doesn't match requested host name `thunlp.oss-cn-qingdao.aliyuncs.com'. To connect to thunlp.oss-cn-qingdao.aliyuncs.com insecurely, use `--no-check-certificate'. Unable to build SSL Connect.
- Run these four commands in turn. (Running shell scripts directly should work, but I won't. [@@|||]
Download this way and you'll get a total of 170+M of data
2.Tsinghua Cloud or Google Drive
Tsinghua Cloud link: https://link.zhihu.com/?target=https%3A//cloud.tsinghua.edu.cn/f/11391e48b72749d8b60a/%3Fdl%3D1
Google Drive link (not opened): https://link.zhihu.com/?target=https%3A//drive.google.com/file/d/1eSGYObt-SRLccvYCsWaHx1ldurp9eDN_/view%3Fusp%3Dsharing
By doing so, you will download compressed files that are approximately 3G in size, as shown below.
For protobuf2json.py, you can drop it into get_entities as follows:
def get_entities(file_name): print("Loading entities...") f = open(file_name, 'rb') for line in f.readlines(): line = line.rstrip() guid, word, type = line.split('\t') guid2entity[guid] = {'id': guid, 'word': word, 'type': type} f.close() print("Finish loading, got {} entities totally".format(len(guid2entity)))
Modify it to the following (modify line 5 only):
Open README.md and run as indicateddef get_entities(file_name): print("Loading entities...") f = open(file_name, 'rb') for line in f.readlines(): line = line.rstrip().decode() guid, word, type = line.split('\t') guid2entity[guid] = {'id': guid, 'word': word, 'type': type} f.close() print("Finish loading, got {} entities totally".format(len(guid2entity)))
protoc --proto_path=. --python_out=. Document.proto python protobuf2json.py
To run the first one, you need to install protoc first. (The installation method obtained by Baidu is sufficient.)
Then, run both commands.