Author: Hu Yuxiao (London School of Political Economy)
Stata Enjoyment Club: Know about | Brief book | Code cloud | CSDN
Source: Three Ways to Create Unique Identifiers (Francis, 2012)
- Lian Henghui Twitter Collection
- Stata Continuous Enjoyment Meeting Topics of Excellence || Wonderful Twitter
Continuous enjoyment of accounting methods topics... ___________.
Continuous Enjoyment Session #Golden Autumn October @ Spatial Measurement Theme (Chengdu, 2019.10.24-27)
Suppose the data to be processed contains three code variables, household_ID, city_ID, and state_ID. These three variables represent the codes of families, cities and states where different individuals live. If the three code variables are considered as a combination, the different numbers in the combination represent different individuals.
This tweet will show you how to generate unique ID s for different individuals in this case.
1. Data Generation
Firstly, 25,000 observations are set.
clear set obs 25000
Assuming that the value range of household_ID is [1, 50], the variable
The range of city_ID is [1,20], and the range of state_ID is [1,50].
*Random Generation gen household_ID = ceil(runiform()*50) gen city_ID = ceil(runiform()*20) gen state_ID = ceil(runiform()*50)
In addition, variables X1 and X2 are generated to simulate variables of interest in real data.
*Random Generation gen x1 = rnormal() gen x2 = rnormal()
The data structure is as follows:
. list household_ID city_ID state_ID x1 x2 in 1/10 +-------------------------------------------------------+ | househ~D city_ID state_ID x1 x2 | |-------------------------------------------------------| 1. | 18 10 28 2.025588 -.1922264 | 2. | 14 11 8 1.042631 -.0807038 | 3. | 7 17 21 .2977124 .6150526 | 4. | 2 4 28 -1.722132 -1.358765 | 5. | 44 8 36 -.7291995 .0929139 | |-------------------------------------------------------| 6. | 18 12 18 .8618261 .6687715 | 7. | 4 4 29 -.239354 .6361541 | 8. | 17 14 2 .516549 -.6399707 | 9. | 28 15 19 -1.812016 -1.628398 | 10. | 44 2 23 -1.015124 -.7855705 | +-------------------------------------------------------+
2. Two ways to create Unique ID
2.1 Create Unique ID with egen command
The easiest way to create a Unique ID is to use the egen command.
. egen ID = group(household_ID city_ID state_ID) . list in 1/10 +---------------------------------------+ | househ~D state_ID city_ID ID | |---------------------------------------| 1. | 18 28 10 6890 | 2. | 14 8 11 5296 | 3. | 7 21 17 2724 | 4. | 2 28 4 469 | 5. | 44 36 8 17091 | |---------------------------------------| 6. | 18 18 12 6925 | 7. | 4 29 4 1253 | 8. | 17 2 14 6553 | 9. | 28 19 15 10945 | 10. | 44 23 2 16966 | +---------------------------------------+
The Unique ID variable ID created by the egen command is sorted in the same way as the sort house_ID city_ID state_ID command, as follows:
. sort household_ID city_ID state_ID \\sort . list household_ID city_ID state_ID ID in 1/10 +------------------------------------+ | househ~D city_ID state_ID ID | |------------------------------------| 1. | 1 1 2 1 | 2. | 1 1 5 2 | 3. | 1 1 7 3 | 4. | 1 1 10 4 | 5. | 1 1 10 4 | |------------------------------------| 6. | 1 1 11 5 | 7. | 1 1 15 6 | 8. | 1 1 15 6 | 9. | 1 1 20 7 | 10. | 1 1 21 8 | +------------------------------------+
2.2 Create Character Unique ID
Another way to create Unique ID variables is to create a string variable. Although this approach is relatively complex, the Unique ID variables generated are more direct and easy to identify.
The specific orders are as follows:
. gen ID3 = "H" + string(household_ID,"%2.0f" ) + "C" + string(city_ID) + "S" + string(state_ID) . list household_ID city_ID state_ID ID3 in 1/10 +-------------------------------------------+ | househ~D city_ID state_ID ID3 | |-------------------------------------------| 1. | 18 10 28 H18C10S28 | 2. | 14 11 8 H14C11S8 | 3. | 7 17 21 H7C17S21 | 4. | 2 4 28 H2C4S28 | 5. | 44 8 36 H44C8S36 | |-------------------------------------------| 6. | 18 12 18 H18C12S18 | 7. | 4 4 29 H4C4S29 | 8. | 17 14 2 H17C14S2 | 9. | 28 15 19 H28C15S19 | 10. | 44 2 23 H44C2S23 | +-------------------------------------------+
3. Processing repeated observations
Because there are repeated observations in the data, the Unique ID created is not unique.
When you run the sum ID command, you can see that the maximum value of the variable ID is 19,757. This indicates that the sample consists of 19,757 different individuals. The results of duplicates report ID show that 4,459 individuals (19,757-15,298 = 4,459) were observed repeatedly in the data set. In empirical research, the reason for this result may be that the data sources of different variables are different, which results in some individuals being observed repeatedly.
. sum ID Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- ID | 25,000 9890.821 5709.989 1 19757 . duplicates report ID Duplicates in terms of ID -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 15298 0 2 | 7528 3764 3 | 1836 1224 4 | 312 234 5 | 20 16 6 | 6 5 --------------------------------------
At this point, a rule dealing with x1 and x2 should be established to ensure that each unique code (Unique ID) has a unique corresponding observation value. The commands for processing x1 and x2 are as follows.
collapse (mean) mean_x1=x1 mean_x2=x2 (median) /// med_x1=x1 med_x2=x2, by(ID)
For repeated observations, the command generates new variables after running, retaining the average and median values of repeated observations for each ID. The data structure is as follows
. list in 1/10 +----------------------------------------------------+ | ID mean_x1 mean_x2 med_x1 med_x2 | |----------------------------------------------------| 1. | 1 .1792767 .7344462 .1792767 .7344462 | 2. | 2 1.093224 -.1823799 1.093224 -.1823799 | 3. | 3 -.3094974 2.653506 -.3094974 2.653506 | 4. | 4 .4993488 .8985389 .4993488 .8985389 | 5. | 5 -.7429997 -.0464208 -.7429997 -.0464208 | |----------------------------------------------------| 6. | 6 .4437077 -.5161228 .4437077 -.5161228 | 7. | 7 -.7664803 1.15589 -.7664803 1.15589 | 8. | 8 .59837 .1051743 .59837 .1051743 | 9. | 9 .9568613 -.7659643 .9568613 -.7659643 | 10. | 10 -.8682789 -1.468759 -.8682789 -1.468759 | +----------------------------------------------------+
In addition, after processing, it can be found that there are no repetitive observations in the data.
. duplicates report ID Duplicates in terms of ID -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 19757 0 --------------------------------------
- The Stata Continuous Enjoyment Club (Public No. Stata China) was founded by Lian Yujun Team of Sun Yat-sen University. It aims to share various experiences and skills of Stata application with you regularly.
- Public Number Twitter Published in Synchronization CSDN-Stata Continuous Enjoyment Meeting ,Brief Book - Stata Enjoyment Meeting and Zhizhi - Lian Yujun Stata column . You can search for the keyword Stata or Stata in the above website and pay attention to us after the conference.
- Click on the bottom of the tweet to view the links in the tweet and download the relevant information.
- Stata Continuous Enjoyment Meeting Topics of Excellence || Wonderful Twitter
- Welcome contributions: You are welcome to submit your articles or notes to Stata Continuous Enjoyment Conference (Public Number: Stata China), we will retain your signature; if you employ more than five articles, you will be eligible for Stata on-site training (one of the elementary or advanced elections) free of charge. You can also from Selection Platform for Lianhe Meeting [002 Alternative Theme] Choose topics of interest to write tweets.
- Comments and information: Welcome your valuable opinions, you can also write to ask for the procedures and data mentioned in the extrapolation.
- Recruiting Talents: Welcome to join our team and learn Stata together. Collaborative editors or writers with more than five articles will be eligible for Stata on-site training (one of the junior or senior elections) free of charge.
- Contact mail: StataChina@163.com
- Stata Twitter List
- Stata Continuous Enjoyment Meeting Topics of Excellence || Wonderful Twitter