NYT-10 Data Acquisition (1.74G)

Preface

Most of this article is copied from Relational Extraction Dataset NYT-10 SemEval2010

1. What is NYT-10?

NYT-10 data is published in Riedel et al. 2010. The text of this paper comes from the corpus labeled by New York Times in the New York Times. Named entities are labeled with the Stanford NER tool in conjunction with the Freebase repository. The relationship between named entity pairs is derived by linking to the relationship in the Freebase repository outside of the reference and by combining remote monitoring methods.

2. Data Download

1.OpenNRE

Data can be downloaded directly from this link: https://github.com/thunlp/OpenNRE/tree/master/benchmark
Take download_nyt10.sh as an example and its contents are as follows:

mkdir nyt10
wget -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_rel2id.json
wget -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_train.txt
wget -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_test.txt

My steps are as follows:

  1. First you need to install wget. (Use the installation method directly obtained by Baidu.)
  2. Open the cmd under the target file you want to download. (Or open the cmd and move it to the target file.)
  3. Change the four commands above to the following:
mkdir nyt10
wget --no-check-certificate -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_rel2id.json
wget --no-check-certificate -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_train.txt
wget --no-check-certificate -P nyt10 https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_test.txt

If you do not modify it, you will get an error as follows. (My system is windows.)

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = D:\wget\GnuWin32/etc/wgetrc
--2021-09-25 23:26:05--  https://thunlp.oss-cn-qingdao.aliyuncs.com/opennre/benchmark/nyt10/nyt10_rel2id.json
 Resolving Host thunlp.oss-cn-qingdao.aliyuncs.com... 119.167.128.167, 119.167.128.167
Connecting to thunlp.oss-cn-qingdao.aliyuncs.com|119.167.128.167|:443... Connected.
ERROR: cannot verify thunlp.oss-cn-qingdao.aliyuncs.com's certificate, issued by `/C=BE/O=GlobalSign nv-sa/CN=GlobalSign Organization Validation CA - SHA256 - G2':
  Unable to locally verify the issuer's authority.
ERROR: certificate common name `*.oss-cn-beijing.aliyuncs.com' doesn't match requested host name `thunlp.oss-cn-qingdao.aliyuncs.com'.
To connect to thunlp.oss-cn-qingdao.aliyuncs.com insecurely, use `--no-check-certificate'.
Unable to build SSL Connect.
  1. Run these four commands in turn. (Running shell scripts directly should work, but I won't. [@@|||]

    Download this way and you'll get a total of 170+M of data

2.Tsinghua Cloud or Google Drive

Tsinghua Cloud link: https://link.zhihu.com/?target=https%3A//cloud.tsinghua.edu.cn/f/11391e48b72749d8b60a/%3Fdl%3D1
Google Drive link (not opened): https://link.zhihu.com/?target=https%3A//drive.google.com/file/d/1eSGYObt-SRLccvYCsWaHx1ldurp9eDN_/view%3Fusp%3Dsharing
By doing so, you will download compressed files that are approximately 3G in size, as shown below.

For protobuf2json.py, you can drop it into get_entities as follows:

def get_entities(file_name):
    print("Loading entities...")
    f = open(file_name, 'rb')
    for line in f.readlines():
        line = line.rstrip()
        guid, word, type = line.split('\t')
        guid2entity[guid] = {'id': guid, 'word': word, 'type': type}       
    f.close()
    print("Finish loading, got {} entities totally".format(len(guid2entity)))

Modify it to the following (modify line 5 only):

def get_entities(file_name):
    print("Loading entities...")
    f = open(file_name, 'rb')
    for line in f.readlines():
        line = line.rstrip().decode()
        guid, word, type = line.split('\t')
        guid2entity[guid] = {'id': guid, 'word': word, 'type': type}       
    f.close()
    print("Finish loading, got {} entities totally".format(len(guid2entity)))
Open README.md and run as indicated
protoc --proto_path=. --python_out=. Document.proto
python protobuf2json.py

To run the first one, you need to install protoc first. (The installation method obtained by Baidu is sufficient.)
Then, run both commands.

summary

Because no direct download was found during the search for the dataset. It was even downloaded from the CSDN during the process, and the result did not feel good. The size of the dataset obtained by the second method was 1.74G, and the size downloaded by the first method was 170+M. From this point alone, I speculate that the data downloaded by the second method might not work.More data than the first one. (I don't know the details either.)

Posted by sharal on Sat, 25 Sep 2021 09:02:38 -0700