Preface
The text and pictures of the article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.
PS: if you need Python learning materials, you can click the link below to get them by yourself
http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef
My analysis is divided into three core steps:
Step 1: crawl the link of product ranking and details page. The required fields are: ranking, product name and details page link step 2: crawl the product details. The required information is:
-
Shopkeeper: isn't that a competitor? Analyze the situation of its explosive products, keep the store links, and analyze the subsequent targeted mining
-
Price: analyze the price range of explosive products, which is helpful for commodity pricing and market segmentation
-
Launch time: new product? How long did it last?
-
Star rating, number of comments, comment tag, all comment links: further crawling comment content to analyze advantages and disadvantages of explosives
-
Size and color: they are also very valuable reference data, but there are problems in the actual climbing process, which will be mentioned later
-
Photo link: don't you want to see what the product looks like?
Step 3: convert the data into visual chart and analyze it.
Can't wait to see the process. Come on~
How to crawl underwear data
The climbing process is divided into three steps
1. Link to the product ranking and details page
Specific fields to be crawled: Rank, item name, item link, IMG SRC
2. Crawl more product information on the product details page
Core issues:
1) build a function to get the details of a single product; 2) use for loop to traverse the link list of the product details page to get the details of each product
3. Crawling comments
Core issues:
1) from the csv file in the previous step, read rank, item name, reviews, reviews link fields 2) build function to read all comments of each product 3) use for loop to get all comments of all products 4) store them in database and csv file
4. Crawling size and color data
As in step 3, the code is basically the same, mainly to confirm the number of size & color comments per page.
Data cleaning and preprocessing
1. Reading and cleaning data
Read the data of 100 commodities from the csv file, filter out the required fields, and clean the data
-
Part of the data read seems to be numeric value, but actually is character. Therefore, type conversion is required (for example, after price splitting, it needs to be converted to float type)
-
NaN to be involved in numerical calculation, replace with average value
2. Processing data with merchant dimension
Obtain the required data: star rating, total number of comments, average number of comments, mean value of lowest price, mean value of highest price, mean value of price, quantity of goods and proportion of the merchants. Standardize the star rating, average number of comments, average price and quantity of goods, and calculate the weighted score.
Which one is stronger?
① star ranking of different businesses
-
The average star rating is 4.15, and more than half of the merchants above the average score (17 / 32)
-
Top 1's lavava is 4.9 points, followed by 5 with 4.5 points.
-
The last n-pear is only 3.2 points
Let me see what LALAVAVA looks like. The products on Amazon look like ordinary swimsuits. The Chinese are still very conservative~
But does a high score really mean a good product? Let's take a look at the number of comments——
② ranking of average comments of different businesses
-
First of all, the average number of comments is only 193, and less than 30% of them (12 / 32) are higher than the average. Considering that Taobao is prone to tens of thousands, our population advantage is envied by the people of the United States;
-
Let's look at the LALAVAVA of star top 1. The number of comments is so small that we have doubts about the real quality of its products;
-
However, N-pear I, which is the reciprocal of stars, also has few comments, which probably leads to the fact that its products are not so good;
-
On the contrary, Garmol with the number of comments of top 1 has a star rating of 4.4, and has many well-known comments, which seems to be a good commodity;
-
The next few have below average star ratings
So, is Amazon's star rating only affected by a few stars of the number of reviews? I checked some information on the Internet and found three important factors in Amazon's star rating: the time when reviews are from now, the number of reviews voted by buyers, and whether there is a verified purchase logo. In addition, factors such as the number of characters in the comment and the number of hits may also affect the star rating of the comment.
It seems that Amazon's monitoring and management of comments is very strict and complex! Of course, the most important thing is to see what Garmol looks like in the first place in the review:
It's more topical than the swimsuit on the top. It's really good that you say it well, very sexy!
③ price range ranking of different merchants (by average price)
-
From the figure, it is clear that ELOVER is targeting the high-end market, with a pricing range of about 49 yuan; on the contrary, Goddessvan only has a pricing range of 0.39 yuan, with only one price. It is speculated that it may be a loss impulse, which will increase the exposure of businesses and seize the low-end market
-
From the perspective of average price, it is basically distributed between $10 and $20, indicating that this is the main price range of the lingerie market; however, there are no businesses in the $20-40 range, so we can make in-depth study in this area to see if we can find evidence that this range is blue Ocean and has greater market potential
-
In terms of the price range of each business, most of them adopt the strategy of multi-color or style. On the one hand, they provide users with more choices, on the other hand, they also reflect the new capabilities of the business. However, only a few adopt the strategy of single burst
The most luxurious ELOVER looks like a goddess indeed. Thumbnails are more attentive than other homes.
So, which business's strategy is more reliable and its market share is larger?
④ pie chart of commodity quantity of merchants
-
Avidlove dominates by 28% of the top 100 products
-
And other businesses are basically single digit proportion, without obvious advantages and disadvantages
Avidlove's underwear is cool, I like it.
After all, it's hard to measure which business is better in a single aspect. It's better to analyze it with multiple indicators~
⑤ weighted ranking of different businesses
After standardizing the star rating, average comment number, average price and quantity of goods, because it is not easy to determine the weighted proportion, the normalized results of the four items x10 are directly accumulated to get the total score, and a stacked map is made.
The proportion of four indicators of each business reflects its own advantages and disadvantages.
-
Avidlove, the cool underwear just now, won the first comprehensive score with the advantage of commodity quantity under the condition of the other three items being in order. It feels like the countryside surrounding the city
-
Garmol, mainly relying on the advantages of word of mouth (star rating, average number of comments), won the second place
-
ELOVER, mainly relying on accurate segmentation of high-end market, won the third place
-
N-pearI, no advantage, no unexpected glory at the bottom
The n-pear with the worst reputation can find the least goods, but the picture is very popular If you don't let it out, it's too strong~
Roughly speaking, if you want to rank top, the word of mouth must not be too bad, at least keep at the average level and above!
⑥ star / price scatter chart of different businesses
The x-axis is the average price of goods of the merchants, the y-axis is the star level of the merchants, and the point size is the quantity of goods. The larger the quantity of goods, the larger the point, and the point color is the average value of comments. The larger the average value of comments, the darker the color is.
The graph is divided into four quadrants by using the mean value of price and star level
(1) upper left quadrant: merchants with favorable benefits; (2) upper right quadrant: merchants a little expensive, but a penny for a penny for goods; (3) lower right quadrant: Merchants expensive, but not of good quality; (4) lower left quadrant: merchants who are cheap but not good
So with the help of this scatter chart, it's much easier to choose a business to buy:
-
In pursuit of cost performance, you can choose Avidlove, and there are many products, you can choose any (the light red business with the largest circle in the picture);
-
In pursuit of high-end, you can choose ELOVER, which has its own reason (the business on the left side of the picture and in the upper left quadrant);
-
To pursue the public, you can choose Garmol, with the most comments and most favorable comments (the reddest merchants in the picture)
Customers can choose the right business according to their preferences, so how to improve themselves as a business?
⑦ word frequency analysis
In the process of crawling, we also crawled the comment tag. Through word frequency analysis, we can find that customers are most concerned about:
1. Fit: size, fit and other related words appear for many times and rank first 2. Quality: good quality, well made; soft and comfortable, fabric are affirmation of material 3. Style: cut, sex, like the picture you know 4. Price: cheaply made, but more doubt about the quality of goods 5. Reputation: highly recommend ed Value tested
The number of comment tags is relatively small, and the frequency of 2.4w comments is further analyzed, and a word cloud is made:
Most intuitively, it is still related to "fit" and quality or style. Then we will continue to analyze the size & color of the products purchased by customers
There are several problems in size & color's word frequency data:
1. There are only about 6000 pieces of data. 2. Size & color can't be distinguished well. So we analyze 3. The naming rules of the merchants are different. For example, one of them will name black, and some of them may be style1 (so some strange numbers are actually the style numbers of the merchants). 4. Some strange words, such as trim, may be crawled wrong when they are crawlers or Exporting csv in a malformed format
It is obvious that:
Size: large, medium and small must be covered, but there are also xlarge, xxlarge and xxxlarge. Amazon is mainly European and American customers, which may be relatively large in size. Therefore, businesses should develop and stock more products for larger customers.
Color aspect: very intuitive: Black > Red > Blue > Green > White > purple... So black and red will never be wrong; green is unexpected to me, and businesses can also try boldly.
Style: the word "trim" and "lace" appear in the word frequency, and lace is the highest!!!
Complete code
Commodity review
1 # 0,Import module 2 from bs4 import BeautifulSoup 3 import requests 4 import random 5 import time 6 from multiprocessing import Pool 7 import csv 8 import pymongo 9 ''' 10 python Learning exchange group: 821460695 more learning materials can be obtained by adding groups 11 ''' 12 # 0,Create database 13 client = pymongo.MongoClient('localhost', 27017) 14 Amazon = client['Amazon'] 15 reviews_info_M = Amazon['reviews_info_M'] 16 17 # 0,Anti creeping measures 18 headers = { 19 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' 20 } 21 22 # http://cn-proxy.com/ 23 proxy_list = [ 24 'http://117.177.250.151:8081', 25 'http://111.85.219.250:3129', 26 'http://122.70.183.138:8118', 27 ] 28 proxy_ip = random.choice(proxy_list) # Random access agent ip 29 proxies = {'http': proxy_ip} 30 31 32 # 1,read csv Medium'Rank','item_name','reviews','reviews_link' 33 csv_file = csv.reader(open('C:/Users/zbd/Desktop/3.csv','r')) 34 reviews_datalst = [] 35 for i in csv_file: 36 reviews_data = { 37 'Rank':i[10], 38 'item_name':i[8], 39 'reviews':i[6], 40 'reviews_link':i[5] 41 } 42 reviews_datalst.append(reviews_data) 43 del reviews_datalst[0] # Delete header 44 #print(reviews_datalst) 45 reviews_links = list(i['reviews_link'] for i in reviews_datalst) # Store comment details page link to list reviews_links 46 47 # Clean reviews,There is a null value or "1",234"style 48 reviews = [] 49 for i in reviews_datalst: 50 if i['reviews']: 51 reviews.append(int(i['reviews'].replace(',',''))) 52 else: 53 reviews.append(0) 54 print(reviews_links) 55 print(reviews) 56 57 # 2,Grab comment page links for each item 58 # Commodity 1 59 # First pages: https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews 60 # Second pages: https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2 61 # Third pages: https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/ref=cm_cr_getr_d_paging_btm_next_3?ie=UTF8&reviewerType=all_reviews&pageNumber=3 62 # Commodity 2 63 # First pages: https://www.amazon.com/Avidlove-Women-Lingerie-Babydoll-Bodysuit/product-reviews/B077CLFWVN/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews' 64 # Second pages: https://www.amazon.com/Avidlove-Women-Lingerie-Babydoll-Bodysuit/product-reviews/B077CLFWVN/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2 65 # Eight comments per page, pages = reviews // 8 + 1 66 # Target format: https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/pageNumber=1 67 url = 'https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews' 68 counts = 0 69 def get_item_reviews(reviews_link,reviews): 70 if reviews_link: 71 pages = reviews // 8 # Eight comments per page, pages = reviews // 8. Do not crawl the last page 72 for i in range(1,pages+1): 73 full_url = reviews_link.split('ref=')[0] + '?pageNumber={}'.format(i) 74 #full_url = 'https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/?pageNumber=10' 75 wb_data = requests.get(full_url, headers=headers, proxies=proxies) 76 soup = BeautifulSoup(wb_data.text, 'lxml') 77 every_page_reviews_num = len(soup.select('div.a-row.a-spacing-small.review-data > span')) 78 for j in range(every_page_reviews_num): 79 reviews_info ={ 80 'customer_name' : soup.select('div:nth-child(1) > a > div.a-profile-content > span')[j].text, 81 'star' : soup.select('div.a-row>a.a-link-normal > i > span')[j].text.split('out')[0], 82 'review_date' : soup.select('div.a-section.review >div>div> span.a-size-base.a-color-secondary.review-date')[j].text, 83 'review_title' : soup.select('a.a-size-base.a-link-normal.review-title.a-color-base.a-text-bold')[j].text, 84 'review_text' : soup.select('div.a-row.a-spacing-small.review-data > span')[j].text, 85 'item_name' : soup.title.text.split(':')[-1] 86 } 87 yield reviews_info 88 reviews_info_M.insert_one(reviews_info) 89 global counts 90 counts +=1 91 print('The first{}Article comment'.format(counts),reviews_info) 92 else: 93 pass 94 95 ''' 96 # This is mainly for size and color crawling. Because there is a large number of missing data, it crawls in addition 97 # Basically the same as the code in the previous step, it is mainly to confirm the number of size & color comments per page 98 # Writing to database and csv also need to be modified, but in the same way 99 100 def get_item_reviews(reviews_link,reviews): 101 if reviews_link: 102 pages = reviews // 8 ා there are 8 comments on each page, pages = reviews // 8. The last page is not crawled. You need to make a judgment of less than 8 comments 103 for i in range(1,pages+1): 104 full_url = reviews_link.split('ref=')[0] + '?pageNumber={}'.format(i) 105 #full_url = 'https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/product-reviews/B0712188H2/?pageNumber=10' 106 wb_data = requests.get(full_url, headers=headers, proxies=proxies) 107 soup = BeautifulSoup(wb_data.text, 'lxml') 108 every_page_reviews_num = len(soup.select('div.a-row.a-spacing-mini.review-data.review-format-strip > a')) # Number of size & color per page 109 for j in range(every_page_reviews_num): 110 reviews_info ={ 111 'item_name' : soup.title.text.split(':')[-1], 112 'size_color' : soup.select('div.a-row.a-spacing-mini.review-data.review-format-strip > a')[j].text, 113 } 114 yield reviews_info 115 print(reviews_info) 116 reviews_size_color.insert_one(reviews_info) 117 else: 118 pass 119 ''' 120 121 122 # 3,Start crawling and storing data 123 all_reviews = [] 124 def get_all_reviews(reviews_links,reviews): 125 for i in range(100): 126 for n in get_item_reviews(reviews_links[i],reviews[i]): 127 all_reviews.append(n) 128 129 get_all_reviews(reviews_links,reviews) 130 #print(all_reviews) 131 132 133 # 4,Write in csv 134 headers = ['_id','item_name', 'customer_name', 'star', 'review_date', 'review_title', 'review_text'] 135 with open('C:/Users/zbd/Desktop/4.csv','w',newline='',encoding='utf-8') as f: 136 f_csv = csv.DictWriter(f, headers) 137 f_csv.writeheader() 138 f_csv.writerows(all_reviews) 139 print('Finished writing!')
Commodity information
1 # 0,Import module 2 from bs4 import BeautifulSoup 3 import requests 4 import random 5 import time 6 from multiprocessing import Pool 7 import pymongo 8 ''' 9 python Learning exchange group: 821460695 more learning materials can be obtained by adding groups 10 ''' 11 # 0,Create database 12 client = pymongo.MongoClient('localhost', 27017) 13 Amazon = client['Amazon'] 14 item_info_M = Amazon['item_info_M'] 15 16 # 0,Anti creeping measures 17 headers = { 18 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' 19 } 20 # http://cn-proxy.com/ 21 proxy_list = [ 22 'http://117.177.250.151:8081', 23 'http://111.85.219.250:3129', 24 'http://122.70.183.138:8118', 25 ] 26 proxy_ip = random.choice(proxy_list) # Random access agent ip 27 proxies = {'http': proxy_ip} 28 29 # 1,Link to product ranking and details page 30 url_page1 = 'https://www.amazon.com/Best-Sellers-Womens-Chemises-Negligees/zgbs/fashion/1044968/ref=zg_bs_pg_1?_encoding=UTF8&pg=1' # 01-50 Famous commodity 31 url_page2 = 'https://www.amazon.com/Best-Sellers-Womens-Chemises-Negligees/zgbs/fashion/1044968/ref=zg_bs_pg_2?_encoding=UTF8&pg=2' # 51-100 Famous commodity 32 33 item_info = [] # List of store item details 34 item_links = [] # Store list of product detail page links 35 def get_item_info(url): 36 wb_data = requests.get(url,headers=headers,proxies=proxies) 37 soup = BeautifulSoup(wb_data.text,'lxml') 38 for i in range(50): 39 data = { 40 'Rank': soup.select('span.zg-badge-text')[i].text.strip('#'), 41 'item_name' : soup.select('#zg-ordered-list > li > span > div > span > a > div')[i].text.strip(), 42 'item_link' : 'https://www.amazon.com' + soup.select('#zg-ordered-list > li > span > div > span > a')[i].get('href'), 43 'img_src' :soup.select('#zg-ordered-list > li> span > div > span > a > span > div > img')[i].get('src') 44 } 45 item_info.append(data) 46 item_links.append(data['item_link']) 47 print('finish!') 48 49 get_item_info(url_page1) 50 get_item_info(url_page2) 51 52 53 54 # 2,Crawl more product information on the product details page 55 #item_url = 'https://www.amazon.com/Avidlove-Lingerie-Babydoll-Sleepwear-Chemise/dp/B0712188H2/ref=zg_bs_1044968_1?_encoding=UTF8&refRID=MYWGH1W2P3HNS58R4WES' 56 def get_item_info_2(item_url,data): 57 wb_data = requests.get(item_url, headers=headers, proxies=proxies) 58 soup = BeautifulSoup(wb_data.text, 'lxml') 59 60 #Obtain price(Need to judge) 61 price = soup.select('#priceblock_ourprice') 62 data['price'] = price[0].text if price else None 63 64 # Obtain star and reviews(Need to judge) 65 star = soup.select('div>div>span>span>span>a>i>span.a-icon-alt') 66 if star: 67 data['star'] = star[0].text.split(' ')[0] 68 data['reviews'] = soup.select('#reviews-medley-footer > div.a-row.a-spacing-large > a')[0].text.split(' ')[2] 69 data['Read reviews that mention'] = list(i.text.strip('\n').strip() for i in soup.select('span.cr-lighthouse-term')) 70 else: 71 data['star'] = None 72 data['reviews'] = None 73 data['Read reviews that mention'] = None 74 75 data['Date_first_listed_on_Amazon'] = soup.select('#detailBullets_feature_div > ul > li> span > span:nth-child(2)')[-1].text 76 77 # Obtain reviews_link(Need to judge) 78 reviews_link = soup.select('#reviews-medley-footer > div.a-row.a-spacing-large > a') 79 if reviews_link: 80 data['reviews_link'] = 'https://www.amazon.com' + reviews_link[0].get('href') 81 else: 82 data['reviews_link'] = None 83 84 # Obtain store and store_link (Need to judge) 85 store = soup.select('#bylineInfo') 86 if store: 87 data['store'] = store[0].text 88 data['store_link'] = 'https://www.amazon.com' + soup.select('#bylineInfo')[0].get('href') 89 else: 90 data['store'] = None 91 data['store_link'] = None 92 93 item_info_M.insert_one(data) # Deposit in MongoDB data base 94 print(data) 95 96 97 98 # 3,Write product details csv file 99 for i in range(100): 100 get_item_info_2(item_links[i],item_info[i]) 101 print('Write down{}A commodity'.format(i+1)) 102 103 import csv 104 headers = ['_id','store', 'price', 'Date_first_listed_on_Amazon', 'item_link', 'reviews_link', 'reviews', 'store_link', 'item_name', 'img_src', 'Rank', 'Read reviews that mention', 'star'] 105 with open('C:/Users/zbd/Desktop/3.csv','w',newline='',encoding='utf-8') as f: 106 f_csv = csv.DictWriter(f,headers) 107 f_csv.writeheader() 108 f_csv.writerows(item_info) 109 110 print('Finished writing!') 111 Word cloud 112 113 114 115 path = 'C:/Users/zbd/Desktop/Amazon/fenci/' 116 117 # Read file, participle 118 def get_text(): 119 f = open(path+'reviews.txt','r',encoding = 'utf-8') 120 text = f.read().lower() # Unified to lowercase 121 for i in '!@#$%^&*()_¯+-;:`~\'"<>=./?,': # Replace English symbols with spaces 122 text = text.replace(i,'') 123 return text.split() # Return word segmentation result 124 125 lst_1= get_text() # participle 126 print('All in all{}Word'.format(len(lst_1))) # Count total words 127 128 129 # Remove stop_word(Common words) 130 131 stop_word_text = open(path+'stop_word.txt','r',encoding = 'utf-8') # Read downloaded stop_word surface 132 stop_word = stop_word_text.read().split() 133 stop_word_add = ['a','i','im','it Holmium','i Holmium','\\u0026','5 Holmium','reviewdate'] # You can continue to add in this list stop_word 134 stop_word_new = stop_word + stop_word_add 135 #print(stop_word_new) 136 lst_2 =list(word for word in lst_1 if word not in stop_word_new) 137 print('After removal, there are{}Word'.format(len(lst_2))) 138 139 # Statistical word frequency 140 counts = {} 141 for i in lst_2: 142 counts[i] = counts.get(i,0) + 1 143 #print(counts) 144 145 word_counts = list(counts.items()) 146 #print(word_counts) 147 148 word_counts.sort(key = lambda x:x[1],reverse = True) # In descending order of word frequency 149 150 # Output result 151 for i in word_counts[0:50]: 152 print(i) 153 154 155 156 157 # Making word clouds 158 from scipy.misc import imread 159 import matplotlib.pyplot as plt 160 import jieba 161 from wordcloud import WordCloud, ImageColorGenerator 162 163 164 stopwords = {} 165 # isCN = 0 # 0: English participle 1: Chinese participle 166 path = 'C:/Users/zbd/Desktop/Amazon/fenci/' 167 back_coloring_path = path + 'img.jpg' # Set background picture path 168 text_path = path + 'reviews.txt' # Set the text path to analyze 169 stopwords_path = path + 'stop_word.txt' # stop list 170 imgname1 = path + 'WordCloudDefautColors.png' # Saved picture name 1(Only follow the shape of background picture) 171 imgname2 = path + 'WordCloudColorsByImg.png' # Saved picture name 2(Color is generated according to the color layout of background picture) 172 #font_path = r'./fonts\simkai.ttf' # Set Chinese font path for matplotlib - mainly for Chinese 173 174 175 back_coloring = imread(back_coloring_path) # Set background picture ---- back_coloring Is a 3-dimensional array 176 177 wc = WordCloud(#font_path = font_path # Set font 178 background_color = 'white', # Set background color 179 max_words = 3000, # Set the maximum number of words displayed 180 mask = back_coloring, # Set background picture 181 max_font_size = 200, # Set font maximum 182 min_font_size = 5, # Set font minimum 183 random_state = 42, # Random N Color schemes 184 width = 1000 , height = 860 ,margin = 2 # Set the default size of the picture, but if you use a background picture 185 # The size of the saved image will be saved according to its size, margin Is the word edge distance 186 ) 187 188 #wc.generate(text) 189 words = {} 190 for i in word_counts: 191 words['{}'.format(i[0])] = i[1] 192 193 wc.generate_from_frequencies(words) 194 # txt_freq Example is { word1: fre1, word2: fre2, word3: fre3,......, wordn: fren } 195 196 197 plt.figure() 198 199 200 # The following code only shows--------Word cloud whose shape is consistent with the background picture and whose color is the default color 201 plt.imshow(wc) 202 plt.axis("off") 203 plt.show() # Draw word clouds 204 wc.to_file(imgname1) # Save pictures 205 206 207 208 209 # The following code shows--------Word cloud with the same shape and color as the background image 210 image_colors = ImageColorGenerator(back_coloring) # Generate color values from background pictures 211 plt.imshow(wc.recolor(color_func=image_colors)) 212 plt.axis("off") 213 plt.show() 214 wc.to_file( imgname2) 215 216 217 218 219 # Show original picture 220 plt.figure() 221 plt.imshow(back_coloring, cmap=plt.cm.gray) 222 plt.axis("off") 223 plt.show() # Save pictures
Data analysis
1 import pandas as pd 2 import numpy as np 3 import matplotlib.pyplot as plt 4 import matplotlib.colors 5 ''' 6 python Learning exchange group: 821460695 more learning materials can be obtained by adding groups 7 ''' 8 get_ipython().magic('matplotlib inline') 9 10 11 # 0,data fetch 12 13 item_info = pd.read_csv('C:/Users/zbd/Desktop/Amazon/item_info.csv', engine = 'python') 14 reviews_new = pd.read_csv('C:/Users/zbd/Desktop/Amazon/reviews_new.csv', engine = 'python') 15 print(item_info.head()) 16 print(len(item_info)) 17 #print(reviews_new.head()) 18 19 20 21 # 1,Cleaning data 22 # Filter out required columns 23 item_info_c = item_info[['Rank','item_name','store','price','Date_first_listed_on_Amazon','star','reviews','Read reviews that mention']] 24 25 # Cleaning column: price 26 item_info_c['price'] = item_info_c['price'].str.replace('$','') 27 item_info_c['min_price'] = item_info_c['price'].str.split('-').str[0].astype('float') 28 item_info_c['max_price'] = item_info_c['price'].str.split('-').str[-1].astype('float') 29 item_info_c['mean_price'] = (item_info_c['max_price']+item_info_c['min_price'])/2 30 31 # Clean NaN value 32 def f_na(data,cols): 33 for i in cols: 34 data[i].fillna(data[i].mean(),inplace = True) 35 return data 36 37 item_info_c = f_na(item_info_c,['star','reviews','min_price','max_price','mean_price']) 38 item_info_c.head(5) 39 40 41 42 43 # 2,Data processing with merchant dimension 44 a = item_info_c.groupby('store')['star'].mean().sort_values(ascending=False) # Average star rating of merchants 45 b = item_info_c.groupby('store')['reviews'].agg({'reviews_sum':np.sum,'reviews_mean':np.mean}) # Total and average number of business reviews 46 c = item_info_c.groupby('store')['min_price'].mean() # Mean value of lowest price 47 d = item_info_c.groupby('store')['max_price'].mean() # Average value of the highest price 48 e = item_info_c.groupby('store')['mean_price'].mean() # Average price of merchants 49 e.name = 'price_mean' 50 f = item_info_c.groupby('store')['star'].count() # Quantity of goods 51 f.name = 'item_num' 52 #print(a,b,c,d,e,f) 53 54 df = pd.concat([a,b,e,f],axis=1) # Percentage of merchandise 55 df['per'] = df['item_num']/100 56 df['per%'] = df['per'].apply(lambda x: '%.2f%%' % (x*100)) 57 58 # Standardized treatment 59 def data_nor(df, *cols): 60 for col in cols: 61 colname = col + '_nor' 62 df[colname] = (df[col]-df[col].min())/(df[col].max()-df[col].min()) * 10 63 return df 64 # Create function, return standardized value, new column name 65 66 df_re = data_nor(df, 'star','reviews_mean','price_mean','item_num') 67 print(df_re.head(5)) 68 69 70 71 72 # 3,Drawing charts 73 74 fig,axes = plt.subplots(4,1,figsize = (10,15)) 75 plt.subplots_adjust(wspace =0, hspace =0.5) 76 77 # Star ranking of different businesses 78 df_star = df['star'].sort_values(ascending = False) 79 df_star.plot(kind = 'bar',color = 'yellow',grid = True,alpha = 0.5,ax =axes[0],width =0.7, 80 ylim = [3,5],title = 'Star ranking of different businesses') 81 axes[0].axhline(df_star.mean(),label = 'Average star level%.2f branch' %df_star.mean() ,color = 'r' ,linestyle = '--',) 82 axes[0].legend(loc = 1) 83 84 # Ranking of average comments of different businesses 85 df_reviews_mean = df['reviews_mean'].sort_values(ascending = False) 86 df_reviews_mean.plot(kind = 'bar',color = 'blue',grid = True,alpha = 0.5,ax =axes[1],width =0.7, 87 title = 'Ranking of average comments of different businesses') 88 axes[1].axhline(df_reviews_mean.mean(),label = 'Average comments%i strip' %df_reviews_mean.mean() ,color = 'r' ,linestyle = '--',) 89 axes[1].legend(loc = 1) 90 91 # Price range of different merchants (by average price) 92 avg_price = (d-c)/2 93 avg_price.name = 'avg_price' 94 max_price = avg_price.copy() 95 max_price.name = 'max_price' 96 97 df_price = pd.concat([c,avg_price,max_price,df_re['price_mean']],axis=1) 98 df_price = df_price.sort_values(['price_mean'],ascending = False) 99 df_price.drop(['price_mean'],axis =1,inplace = True) 100 df_price.plot(kind = 'bar',grid = True,alpha = 0.5 , ax =axes[2],width =0.7,stacked = True, 101 color= ['white','red','blue'],ylim = [0,55],title = 'Price range of different businesses') 102 103 # Weighted ranking of different businesses 104 df_nor = pd.concat([df_re['star_nor'],df_re['reviews_mean_nor'],df_re['price_mean_nor'],df_re['item_num_nor']],axis =1) 105 df_nor['nor_total'] = df_re['star_nor'] + df_re['reviews_mean_nor'] + df_re['price_mean_nor'] + df_re['item_num_nor'] 106 df_nor = df_nor.sort_values(['nor_total'],ascending = False) 107 df_nor.drop(['nor_total'],axis = 1,inplace = True) 108 df_nor.plot(kind = 'bar',grid = True,alpha = 0.5 , ax =axes[3],width =0.7,stacked = True, 109 title = 'Weighted ranking of different businesses') 110 111 112 113 114 # Number of merchants pie chart 115 colors = ['aliceblue','antiquewhite','beige','bisque','blanchedalmond','blue','blueviolet','brown','burlywood', 116 'cadetblue','chartreuse','chocolate','coral','cornflowerblue','cornsilk','crimson','cyan','darkblue','darkcyan','darkgoldenrod', 117 'darkgreen','darkkhaki','darkviolet','deeppink','deepskyblue','dimgray','dodgerblue','firebrick','floralwhite','forestgreen', 118 'gainsboro','ghostwhite','gold','goldenrod'] 119 120 df_per = df_re['item_num'] 121 fig,axes = plt.subplots(1,1,figsize = (8,8)) 122 plt.axis('equal') #Ensure equal length and width 123 plt.pie(df_per , 124 labels = df_per.index , 125 autopct = '%.2f%%', 126 pctdistance = 1.05 , 127 #shadow = True , 128 startangle = 0 , 129 radius = 1.5 , 130 colors = colors, 131 frame = False 132 ) 133 134 135 136 # Stars of different businesses/Price scatter 137 plt.figure(figsize=(13,8)) 138 x = df_re['price_mean'] # x Axis is average price. 139 y = df_re['star'] # y The axis is star. 140 s = df_re['item_num']*100 # Point size is the quantity of goods. The larger the quantity of goods is, the larger the point is 141 c = df_re['reviews_mean']*10 # The point color is the comment mean value. The larger the comment mean value is, the darker the color is 142 plt.scatter(x,y,marker='.',cmap='Reds',alpha=0.8, 143 s = s,c = c) 144 plt.grid() 145 plt.title('Stars of different businesses/Price scatter') 146 plt.xlim([0,50]) 147 plt.ylim([3,5]) 148 plt.xlabel('price') 149 plt.ylabel('star') 150 151 # Draw average line and legend 152 p_mean = df_re['price_mean'].mean() 153 s_mean = df_re['star'].mean() 154 plt.axvline(p_mean,label = 'average price%.2f$' %p_mean ,color = 'r' ,linestyle = '--',) 155 plt.axhline(s_mean,label = 'Average star level%.2f' %s_mean ,color = 'g' ,linestyle = '-.') 156 plt.axvspan(p_mean, 50, ymin= (s_mean-3)/(5-3), ymax=1,alpha = 0.1,color = 'g') 157 plt.axhspan(0, s_mean, xmin= 0 , xmax=p_mean/50,alpha = 0.1,color = 'grey') 158 plt.legend(loc = 2) 159 160 # Add merchant tag 161 for x,y,name in zip(df_re['price_mean'],df_re['star'],df_re.index): 162 plt.annotate(name, xy=(x,y),xytext = (0, -5), textcoords = 'offset points',ha = 'center', va = 'top',fontsize = 9) 163 164 165 166 167 # Cleaning column: Read reviews that mention 168 df_rrtm = item_info_c['Read reviews that mention'].fillna('missing data',inplace =False) 169 df_rrtm = df_rrtm.str.strip('[') 170 df_rrtm = df_rrtm.str.rstrip(']') 171 df_rrtm = df_rrtm.str.replace('\'','') 172 173 reviews_labels = [] 174 for i in df_rrtm: 175 reviews_labels = reviews_labels+i.split(',') 176 #print(reviews_labels) 177 178 179 labels = [] 180 for j in reviews_labels: 181 if j != 'missing data': 182 labels.append(j) 183 #print(labels) 184 185 186 187 # Statistics label word frequency 188 counts = {} 189 for i in labels: 190 counts[i] = counts.get(i,0) + 1 191 #print(counts) 192 193 label_counts = list(counts.items()) 194 #print(word_counts) 195 196 label_counts.sort(key = lambda x:x[1],reverse = True) # In descending order of word frequency 197 198 print('In total%i Comment tags,Top20 As follows:'%len(label_counts)) 199 print('-----------------------------') 200 # Output result 201 for i in label_counts[:20]: 202 print(i)