Stata: Three Ways to Generate Unique Data Encoding

Keywords: Session


Author: Hu Yuxiao (London School of Political Economy)

Stata Enjoyment Club: Know about | Brief book | Code cloud | CSDN

Source: Three Ways to Create Unique Identifiers (Francis, 2012)

Continuous enjoyment of accounting methods topics... ___________.

Continuous Enjoyment Session #Golden Autumn October @ Spatial Measurement Theme (Chengdu, 2019.10.24-27)

Suppose the data to be processed contains three code variables, household_ID, city_ID, and state_ID. These three variables represent the codes of families, cities and states where different individuals live. If the three code variables are considered as a combination, the different numbers in the combination represent different individuals.

This tweet will show you how to generate unique ID s for different individuals in this case.

1. Data Generation

Firstly, 25,000 observations are set.

clear 
set obs 25000

Assuming that the value range of household_ID is [1, 50], the variable
The range of city_ID is [1,20], and the range of state_ID is [1,50].

*Random Generation
gen household_ID = ceil(runiform()*50)
gen city_ID = ceil(runiform()*20)
gen state_ID = ceil(runiform()*50)

In addition, variables X1 and X2 are generated to simulate variables of interest in real data.

*Random Generation
gen x1 = rnormal()
gen x2 = rnormal()

The data structure is as follows:

. list household_ID city_ID state_ID x1 x2 in 1/10

     +-------------------------------------------------------+
     | househ~D   city_ID   state_ID          x1          x2 |
     |-------------------------------------------------------|
  1. |       18        10         28    2.025588   -.1922264 |
  2. |       14        11          8    1.042631   -.0807038 |
  3. |        7        17         21    .2977124    .6150526 |
  4. |        2         4         28   -1.722132   -1.358765 |
  5. |       44         8         36   -.7291995    .0929139 |
     |-------------------------------------------------------|
  6. |       18        12         18    .8618261    .6687715 |
  7. |        4         4         29    -.239354    .6361541 |
  8. |       17        14          2     .516549   -.6399707 |
  9. |       28        15         19   -1.812016   -1.628398 |
 10. |       44         2         23   -1.015124   -.7855705 |
     +-------------------------------------------------------+

2. Two ways to create Unique ID

2.1 Create Unique ID with egen command

The easiest way to create a Unique ID is to use the egen command.

. egen ID = group(household_ID city_ID state_ID)
. list in 1/10

     +---------------------------------------+
     | househ~D   state_ID   city_ID      ID |
     |---------------------------------------|
  1. |       18         28        10    6890 |
  2. |       14          8        11    5296 |
  3. |        7         21        17    2724 |
  4. |        2         28         4     469 |
  5. |       44         36         8   17091 |
     |---------------------------------------|
  6. |       18         18        12    6925 |
  7. |        4         29         4    1253 |
  8. |       17          2        14    6553 |
  9. |       28         19        15   10945 |
 10. |       44         23         2   16966 |
     +---------------------------------------+

The Unique ID variable ID created by the egen command is sorted in the same way as the sort house_ID city_ID state_ID command, as follows:

. sort household_ID city_ID state_ID \\sort
. list household_ID city_ID state_ID ID in 1/10

     +------------------------------------+
     | househ~D   city_ID   state_ID   ID |
     |------------------------------------|
  1. |        1         1          2    1 |
  2. |        1         1          5    2 |
  3. |        1         1          7    3 |
  4. |        1         1         10    4 |
  5. |        1         1         10    4 |
     |------------------------------------|
  6. |        1         1         11    5 |
  7. |        1         1         15    6 |
  8. |        1         1         15    6 |
  9. |        1         1         20    7 |
 10. |        1         1         21    8 |
     +------------------------------------+

2.2 Create Character Unique ID

Another way to create Unique ID variables is to create a string variable. Although this approach is relatively complex, the Unique ID variables generated are more direct and easy to identify.

The specific orders are as follows:

. gen ID3 = "H" + string(household_ID,"%2.0f" ) + "C" + string(city_ID) + "S" + string(state_ID)
. list household_ID city_ID state_ID ID3 in 1/10

     +-------------------------------------------+
     | househ~D   city_ID   state_ID         ID3 |
     |-------------------------------------------|
  1. |       18        10         28   H18C10S28 |
  2. |       14        11          8    H14C11S8 |
  3. |        7        17         21    H7C17S21 |
  4. |        2         4         28     H2C4S28 |
  5. |       44         8         36    H44C8S36 |
     |-------------------------------------------|
  6. |       18        12         18   H18C12S18 |
  7. |        4         4         29     H4C4S29 |
  8. |       17        14          2    H17C14S2 |
  9. |       28        15         19   H28C15S19 |
 10. |       44         2         23    H44C2S23 |
     +-------------------------------------------+

3. Processing repeated observations

Because there are repeated observations in the data, the Unique ID created is not unique.

When you run the sum ID command, you can see that the maximum value of the variable ID is 19,757. This indicates that the sample consists of 19,757 different individuals. The results of duplicates report ID show that 4,459 individuals (19,757-15,298 = 4,459) were observed repeatedly in the data set. In empirical research, the reason for this result may be that the data sources of different variables are different, which results in some individuals being observed repeatedly.

. sum ID

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
          ID |     25,000    9890.821    5709.989          1      19757

. duplicates report ID

Duplicates in terms of ID

--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |        15298             0
        2 |         7528          3764
        3 |         1836          1224
        4 |          312           234
        5 |           20            16
        6 |            6             5
--------------------------------------

At this point, a rule dealing with x1 and x2 should be established to ensure that each unique code (Unique ID) has a unique corresponding observation value. The commands for processing x1 and x2 are as follows.

collapse (mean) mean_x1=x1 mean_x2=x2 (median) ///
               med_x1=x1 med_x2=x2, by(ID)

For repeated observations, the command generates new variables after running, retaining the average and median values of repeated observations for each ID. The data structure is as follows

. list in 1/10

     +----------------------------------------------------+
     | ID     mean_x1     mean_x2      med_x1      med_x2 |
     |----------------------------------------------------|
  1. |  1    .1792767    .7344462    .1792767    .7344462 |
  2. |  2    1.093224   -.1823799    1.093224   -.1823799 |
  3. |  3   -.3094974    2.653506   -.3094974    2.653506 |
  4. |  4    .4993488    .8985389    .4993488    .8985389 |
  5. |  5   -.7429997   -.0464208   -.7429997   -.0464208 |
     |----------------------------------------------------|
  6. |  6    .4437077   -.5161228    .4437077   -.5161228 |
  7. |  7   -.7664803     1.15589   -.7664803     1.15589 |
  8. |  8      .59837    .1051743      .59837    .1051743 |
  9. |  9    .9568613   -.7659643    .9568613   -.7659643 |
 10. | 10   -.8682789   -1.468759   -.8682789   -1.468759 |
     +----------------------------------------------------+

In addition, after processing, it can be found that there are no repetitive observations in the data.

. duplicates report ID

Duplicates in terms of ID
--------------------------------------
   copies | observations       surplus
----------+---------------------------
        1 |        19757             0
--------------------------------------

Continuous enjoyment of accounting methods topics... ___________.

About us

Contact us

  • Welcome contributions: You are welcome to submit your articles or notes to Stata Continuous Enjoyment Conference (Public Number: Stata China), we will retain your signature; if you employ more than five articles, you will be eligible for Stata on-site training (one of the elementary or advanced elections) free of charge. You can also from Selection Platform for Lianhe Meeting [002 Alternative Theme] Choose topics of interest to write tweets.
  • Comments and information: Welcome your valuable opinions, you can also write to ask for the procedures and data mentioned in the extrapolation.
  • Recruiting Talents: Welcome to join our team and learn Stata together. Collaborative editors or writers with more than five articles will be eligible for Stata on-site training (one of the junior or senior elections) free of charge.
  • Contact mail: StataChina@163.com

Previous wonderful tweets

Twitter List

Posted by Locust on Fri, 02 Aug 2019 03:58:51 -0700