HanLP Chinese word segmentation, person name recognition and place name recognition

Keywords: Python Big Data Data Analysis NLP

HanLP Chinese word segmentation, person name recognition and place name recognition

Experimental purpose

  1. Download and install HanLP natural language processing package from the Internet;
  2. Familiar with the basic functions of HanLP natural language processing package;
  3. Using the information obtained by the web crawler, call the API of HanLP for Chinese word segmentation, person name recognition and place name recognition

Research background

With the rapid development of the Internet and information technology, a large amount of text information has been produced in the network, but these information exists in disorder, which brings a lot of inconvenience to users to quickly find, browse text information and obtain valuable information. In view of this, natural language processing technology came into being. It is an important basis for information retrieval and text mining. Its main task is to segment the article and obtain the part of speech and meaning of each word. As the basis for further information mining, it brings great convenience to researchers [1].

Experimental content

Using Python HanLP The natural language processing package calls its API to realize Chinese word segmentation, Chinese name recognition, transliteration recognition, name and place name recognition. The text selected in this experiment is the first 40 introduction data of Baidu Encyclopedia crawled by the web crawler (I) in Experiment 3, and compares the results under different Semantic algorithms.

Python implementation

HanLP: Han Language Processing

The multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x, for advancing state-of-the-art deep learning techniques in both academia and industry. HanLP was designed from day one to be efficient, user friendly and extendable. It comes with pretrained models for various human languages including English, Chinese, Japanese and many others.

Text analysis target definition

The text selected in this experiment is the first 40 introduction data of Baidu Encyclopedia crawled by the web crawler (I) in Experiment 3.

  • Partial text display:

    Clonbach (1916-2001), American psychologist and educator. He created a set of commonly used methods to measure the reliability of psychological or educational tests - "clonbach coefficient"( Cronbach's coefficient alpha),On this basis, a statistical model for determining the measurement error was established. He was elected president of the American Psychological Society in 1957, won the outstanding scientific contribution award issued by the American Psychological Society in 1973, and was elected academician of the National Academy of Sciences in 1974( Cronbach's alpha)It is a method to measure the reliability of a scale or testαThe coefficient was not first proposed by clonbach. Basically, clonbach formulates the reliability coefficient proposed by predecessors in an article. It overcomes the shortcomings of the partial halving method and is the most commonly used reliability analysis method in social science research. United States of America (English: The United States of America,United States,Referred to as "the United States"), it is a Federal Republic constitutional country composed of Washington, D.C., 50 states, Guam and many other overseas territories. Its main part is located in central North America. The total area of the United States in the initial edition of the world profile of the CIA from 1989 to 1996 is 937.3 10000 square kilometers, population 3.33 The United States was originally a colony of Indians. At the end of the 15th century, Spain, the Netherlands, France and Britain immigrated here one after another. Before the 18th century, Britain established 13 British North American colonies along the Atlantic coast of the United States. In 1775, the American people broke out the war of independence against the colonial rule of the British Empire. The University of Chicago was founded by Oil King John·Founded by Rockefeller, it is famous for being rich in Nobel laureates, about 40 years old%As of October 2020, among the alumni, professors and researchers of the University of Chicago, there were 100 Nobel Prize winners, ranking fourth in the world, and 10 fields prize winners (sixth in the world) 4. The winner of the Nobel Prize for physics and 25 Pulitzer Prize winners have worked or studied at Chicago University. Chinese Nobel laureates Yang Zhenning, Li Zhengdao and Cui Qi have obtained their doctorates in physics from the University of Chicago. Obama, the 44th president of the United States, has long taught constitution at Chicago University Law school (1992)-2004 Year).
    

Install RESTful Packages

In this experiment, instead of installing the local hanlp package, we chose its lightweight RESTful packages, which is convenient and fast.

HanLPClient instantiation

from hanlp_restful import HanLPClient
HanLP = HanLPClient('https://www.hanlp.com/api', auth=None, language='zh')

tokenize [2]

tokenize(text: Union[str, List[str]], coarse: Optional[bool] = None, language=None) → List[List[str]]

Split a document into sentences and tokenize them. Note that it is always faster to tokenize a whole document than to tokenize each sentence one by one. So avoid calling this method sentence by sentence but put sentences into a list and pass them to the text argument.

fine-grained

seg = HanLP.tokenize(sample)
seg

Partial output:

[['Clonbach', 'U.S.A', 'psychologist', ',', 'Educationist', '. '],
 ['he', 'establish', 'Yes', 'a set', 'Commonly used', 'of', 'measure',...],
 ...
]

coarse-grained

seg_coar = HanLP.tokenize(sample, coarse=True)
seg_coar

Partial output:

[ ...
  ['1957 year', 'election', 'by', 'American Psychological Association', 'chairman', ',', '1973 year', 'Obtain', 'American Psychological Association', 'award', 'of', 'outstanding', ... ], ...
]

Named Entity Recognition [3]

Each element is a tuple of (entity, type, begin, end), where ends are exclusive offsets.

pku [4]

doc = HanLP(sample, tasks='ner/pku', language='zh')
ner_pku = doc['ner/pku']
ner_pku

Partial output:

[[['Clonbach', 'nr', 0, 1], ['U.S.A', 'ns', 7, 8]],
 [['Bach', 'nr', 18, 19]],
 [['American Psychological Association', 'nt', 4, 7], ['American Psychological Association', 'nt', 12, 15], ['National Academy of Sciences', 'nt', 26, 28]],
 [['Bach', 'nr', 1, 2]],...
]

Person name extraction (nr):

nr = [w[0] for s in ner_pku for w in s if w[1]=='nr']
set(nr)
{'Bach', 'Wiener', 'Bourbon', 'Carl·Heinrich·Marx', 'Engels', 'Clonbach', 'Silicon', 'George·Berkeley', 'J.C', 'Hans·Morgenso', 'Cui Qi', 'Zhen Ning Yang', 'Enrico ·Fermi', 'Louis·', 'ampere', 'Obama', 'Raymond·Aron', 'Stanford', 'Lee·Clonbach', 'Philip', 'Abraham·Lincoln', 'John·Rockefeller', 'Lee', 'Hurun', 'Carl·Marx', 'Marx', 'George·Washington', 'Li Zhengdao'}

Place name extraction (ns):

{'Germany', 'maxwell', 'Greece', 'North America', 'Hollywood', 'Appease', 'San Francisco', 'France', 'China', 'U.S.A', 'San Francisco Bay Area', 'California', 'White House', 'Guam', 'Europe and America', 'California', 'Thames', 'Irvine ', 'aegean sea', 'Hong Kong SAR', 'Wall Street', 'Auckland', 'turkey', 'Israel', 'North America', 'Spain', 'Philadelphia', 'Netherlands', 'Free state', 'Soviet Union', 'Palo Alto ', 'silicon valley', 'Berkeley', 'Broadway', 'Western Europe', 'San Diego', 'Berkeley', 'British Empire', 'Stanford', 'Europe', 'Slave state', 'mediterranean sea', 'Sicily', 'United States of America', 'Rome', 'ancient Greek', 'America', 'Santa Barbara ', 'Washington, D.C', 'Latin America', 'Italy', 'Los Angeles', 'Ionia', 'Paris', 'britain', 'Chicago', 'Atlantic'}

msra [5]

Partial output:

[[['Clonbach', 'PERSON', 0, 1], ['2001', 'DATE', 4, 5], ['U.S.A', 'LOCATION', 7, 8]],
 [['Bach', 'PERSON', 18, 19]],
 [['1957 year', 'DATE', 0, 2],
  ['American Psychological Association', 'ORGANIZATION', 4, 7],
  ['1973 year', 'DATE', 9, 11],...
 ],...
]

PERSON name extraction (PERSON):

{'Bach', 'Wiener', 'Bourbon', 'Carl·Heinrich·Marx', 'Engels', 'Clonbach', 'Pulitzer', 'Fields Medal ', 'George·Berkeley', 'J.C', 'Hans·Morgenso', 'Cui Qi', 'Li Zhengdao', 'Zhen Ning Yang', 'Enrico ·Fermi', 'Louis·', 'ampere', 'Obama', 'Raymond·Aron', 'Stanford', 'Lee·Clonbach', 'Philip', 'Abraham·Lincoln', 'John·Rockefeller', 'Nobel Prize', 'Lee', 'Hurun', 'Carl·Marx', 'Marx', 'George·Washington', 'Nobel'}

LOCATION:

{'Germany', 'maxwell', 'Greece', 'North America', 'Appease', 'San Francisco', 'France', 'China', 'Washington', 'U.S.A', 'San Francisco Bay Area', 'California', 'Silicon', 'method', 'Beiwan', 'Guam', 'Europe and America', 'Berkeley', 'California', 'Britain', 'Thames', 'Renaissance', 'Hong Kong SAR', 'aegean sea', 'Auckland', 'Greek peninsula', 'turkey', 'Joseph', 'mainland', 'Israel', 'North America', 'Spain', 'Philadelphia', 'Netherlands', 'Free state', 'Soviet Union', 'Palo Alto ', 'silicon valley', 'Berkeley', 'Western Europe', 'District of Columbia', 'cybernetics', 'Berkeley', 'British Empire', 'Europe', 'beautiful', 'Cyril', 'Slave state', 'mediterranean sea', 'Sicily', 'United States of America', 'California', 'Manhattan', 'Arabic', 'Hippo', 'Rome', 'ancient Greek', 'UCSB', 'America', 'Latin America', 'Italy', 'Los Angeles', 'Ionia', 'Paris', 'britain', 'Chicago', 'Atlantic', 'earth'}

comparative analysis

It can be found that msra algorithm can recognize more English place names and person names than pku algorithm. In addition, msra can recognize more types of entities, but the speed is slower. For the actual text analysis problem, we should choose the appropriate algorithm according to the specific problem.

python3.8 with packages: hanlp_restful

reference

  1. Lu jiangkun, Wang Linlin. Python data mining practice [M]. Xi'an: Xi'an University of Electronic Science and Technology Press, 2021. 190-205
  2. NER: https://hanlp.hankcs.com/docs/api/hanlp/pretrained/ner.html
  3. TOK: https://hanlp.hankcs.com/docs/api/hanlp/pretrained/tok.html
  4. pku: https://hanlp.hankcs.com/docs/annotations/ner/pku.html
  5. msra: https://hanlp.hankcs.com/docs/annotations/ner/msra.html
  6. He,Han and Choi,Jinho D. (2021) .Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.Online and Punta Cana, Dominican Republic:Association for Computational Linguistics.

Please indicate the source for Reprint: © ️ Sylvan Ding

Posted by Tjorriemorrie on Sun, 21 Nov 2021 18:00:35 -0800