2018-03-01 Data Structure and Algorithms (3)
1.11 Named Section
Suppose you have a piece of code that extracts specific data fields (such as files or similar formats) from several fixed positions in a record string:
###### 0123456789012345678901234567890123456789012345678901234567890' record = '....................100 .......513.25 ..........' cost = int(record[20:23]) * float(record[31:37])
Instead of writing like that, why don't you want to name slices like this?
SHARES = slice(20, 23) PRICE = slice(31, 37) cost = int(record[SHARES]) * float(record[PRICE])
The built-in slice() function creates a slice object that can be used where any slice is allowed. For example:
>>> items = [0, 1, 2, 3, 4, 5, 6] >>> a = slice(2, 4) >>> items[2:4] [2, 3] >>> items[a] [2, 3] >>> items[a] = [10,11] >>> items [0, 1, 10, 11, 4, 5, 6] >>> del items[a] >>> items [0, 1, 4, 5, 6]
If you have a slice object a, you can call its a.start, a.stop, and a.step attributes to get more information. For example:
>>> a = slice(5, 50, 2) >>> a.start 5 >>> a.stop 50 >>> a.step 2 >>>
In addition, you can map the slice indices(size) method to a sequence of definite sizes, which returns a triple (start, stop, step), and all values are appropriately narrowed to meet boundary constraints, thus avoiding IndexError exceptions when used. For example:
>>> s = 'HelloWorld' >>> a.indices(len(s)) (5, 10, 2) >>> for i in range(*a.indices(len(s))): ... print(s[i]) ... W r d >>>
The most frequent elements in the 1.12 sequence
The collections.Counter class is designed specifically for this type of problem, and it even has a useful most_common() method that gives you a direct answer.
For demonstration, suppose you have a list of words and want to find out which words appear most frequently. You can do this:
words = [ 'look', 'into', 'my', 'eyes', 'look', 'into', 'my', 'eyes', 'the', 'eyes', 'the', 'eyes', 'the', 'eyes', 'not', 'around', 'the', 'eyes', "don't", 'look', 'around', 'the', 'eyes', 'look', 'into', 'my', 'eyes', "you're", 'under' ] from collections import Counter word_counts = Counter(words) # Three words with the highest frequency top_three = word_counts.most_common(3) print(top_three) # Outputs [('eyes', 8), ('the', 5), ('look', 4)]
As input, Counter objects can accept arbitrary sequence objects consisting of hashable elements. In the underlying implementation, a Counter object is a dictionary that maps elements to the number of times it occurs. For example:
>>> word_counts['not'] 1 >>> word_counts['eyes'] 8 >>>
If you want to increase the count manually, you can simply add:
>>> morewords = ['why','are','you','not','looking','in','my','eyes'] >>> for word in morewords: ... word_counts[word] += 1 ... >>> word_counts['eyes'] 9 >>>
Or you can use the update() method:
>>> word_counts.update(morewords) >>>
A little-known feature of Counter instances is that they can be easily combined with mathematical operations. For example:
>>> a = Counter(words) >>> b = Counter(morewords) >>> a Counter({'eyes': 8, 'the': 5, 'look': 4, 'into': 3, 'my': 3, 'around': 2, "you're": 1, "don't": 1, 'under': 1, 'not': 1}) >>> b Counter({'eyes': 1, 'looking': 1, 'are': 1, 'in': 1, 'not': 1, 'you': 1, 'my': 1, 'why': 1}) >>> # Combine counts >>> c = a + b >>> c Counter({'eyes': 9, 'the': 5, 'look': 4, 'my': 4, 'into': 3, 'not': 2, 'around': 2, "you're": 1, "don't": 1, 'in': 1, 'why': 1, 'looking': 1, 'are': 1, 'under': 1, 'you': 1}) >>> # Subtract counts >>> d = a - b >>> d Counter({'eyes': 7, 'the': 5, 'look': 4, 'into': 3, 'my': 2, 'around': 2, "you're": 1, "don't": 1, 'under': 1}) >>>
There is no doubt that Counter objects are very useful tools in almost all situations where tabulating or counting data is required. When solving this kind of problem, you should choose it first instead of using dictionary manually.
1.13 Sort a dictionary list by a key word
By using the itemgetter function of the operator module, it is very easy to sort such data structures. Suppose you retrieve a list of website membership information from the database and return the following data structure:
rows = [ {'fname': 'Brian', 'lname': 'Jones', 'uid': 1003}, {'fname': 'David', 'lname': 'Beazley', 'uid': 1002}, {'fname': 'John', 'lname': 'Cleese', 'uid': 1001}, {'fname': 'Big', 'lname': 'Jones', 'uid': 1004} ]
Sorting input result rows according to any dictionary field is easy to implement. Code example:
from operator import itemgetter rows_by_fname = sorted(rows, key=itemgetter('fname')) rows_by_uid = sorted(rows, key=itemgetter('uid')) print(rows_by_fname) print(rows_by_uid) #The output of the code is as follows: [{'fname': 'Big', 'uid': 1004, 'lname': 'Jones'}, {'fname': 'Brian', 'uid': 1003, 'lname': 'Jones'}, {'fname': 'David', 'uid': 1002, 'lname': 'Beazley'}, {'fname': 'John', 'uid': 1001, 'lname': 'Cleese'}] [{'fname': 'John', 'uid': 1001, 'lname': 'Cleese'}, {'fname': 'David', 'uid': 1002, 'lname': 'Beazley'}, {'fname': 'Brian', 'uid': 1003, 'lname': 'Jones'}, {'fname': 'Big', 'uid': 1004, 'lname': 'Jones'}]
The itemgetter() function also supports multiple keys, such as the following code:
rows_by_lfname = sorted(rows, key=itemgetter('lname','fname')) print(rows_by_lfname)
#The output results are as follows:[{'fname': 'David', 'uid': 1002, 'lname': 'Beazley'}, {'fname': 'John', 'uid': 1001, 'lname': 'Cleese'}, {'fname': 'Big', 'uid': 1004, 'lname': 'Jones'}, {'fname': 'Brian', 'uid': 1003, 'lname': 'Jones'}]
In the above example, rows are passed to the sorted() built-in function that accepts a keyword parameter. This parameter is a callable type, and accepts a single element from rows, then returns the value used for sorting. The itemgetter() function is responsible for creating the callable object.
The operator.itemgetter() function has an index parameter that is used by the records in rows to find values. It can be a dictionary key name, an integer value, or any value that can be passed into an object's u getitem_() method. If you pass in multiple index parameters to itemgetter(), the callable object it generates will return a tuple containing all the element values, and the sorted() function will sort the elements in the tuple according to their order. But this method is useful when you want to sort on several fields at the same time (for example, by name and name, as in the example).
itemgetter() can sometimes be replaced by lambda expressions, such as:
rows_by_fname = sorted(rows, key=lambda r: r['fname']) rows_by_lfname = sorted(rows, key=lambda r: (r['lname'],r['fname']))
It's also a good plan. However, using itemgetter() will run slightly faster. So, if you have high performance requirements, use itemgetter().
Finally, don't forget that the techniques shown in this section also apply to functions such as min() and max(). For example:
>>> min(rows, key=itemgetter('uid')) {'fname': 'John', 'lname': 'Cleese', 'uid': 1001} >>> max(rows, key=itemgetter('uid')) {'fname': 'Big', 'lname': 'Jones', 'uid': 1004} >>>
1.14 Sorting does not support native comparison objects
The built-in sorted() function has a keyword parameter key that can be passed into a callable object that returns a value for each incoming object, which is sorted to sort these objects. For example, if you have a sequence of User instances in your application and you want to sort them by their user_id attribute, you can provide a callable object that takes the User instance as input and outputs the corresponding user_id value. For example:
class User: def __init__(self, user_id): self.user_id = user_id def __repr__(self): return 'User({})'.format(self.user_id) def sort_notcompare(): users = [User(23), User(3), User(99)] print(users) print(sorted(users, key=lambda u: u.user_id)) #Another way is to use operator.attrgetter() instead of lambda function: >>> from operator import attrgetter >>> sorted(users, key=attrgetter('user_id')) [User(3), User(23), User(99)] >>>
It should also be noted that the techniques used in this section are also applicable to functions such as min() and max(). For example:
>>> min(users, key=attrgetter('user_id')) User(3) >>> max(users, key=attrgetter('user_id')) User(99) >>>
1.15 Grouping records by a field
rows = [ {'address': '5412 N CLARK', 'date': '07/01/2012'}, {'address': '5148 N CLARK', 'date': '07/04/2012'}, {'address': '5800 E 58TH', 'date': '07/02/2012'}, {'address': '2122 N CLARK', 'date': '07/03/2012'}, {'address': '5645 N RAVENSWOOD', 'date': '07/02/2012'}, {'address': '1060 W ADDISON', 'date': '07/02/2012'}, {'address': '4801 N BROADWAY', 'date': '07/01/2012'}, {'address': '1039 W GRANVILLE', 'date': '07/04/2012'}, ]
Now suppose you want to iterate over data blocks grouped by date. To do this, you first need to sort by the specified field (date in this case), and then call the itertools.groupby() function:
from operator import itemgetter from itertools import groupby # Sort by the desired field first rows.sort(key=itemgetter('date')) # Iterate in groups for date, items in groupby(rows, key=itemgetter('date')): print(date) for i in items: print(' ', i)
Operation results:
07/01/2012 {'date': '07/01/2012', 'address': '5412 N CLARK'} {'date': '07/01/2012', 'address': '4801 N BROADWAY'} 07/02/2012 {'date': '07/02/2012', 'address': '5800 E 58TH'} {'date': '07/02/2012', 'address': '5645 N RAVENSWOOD'} {'date': '07/02/2012', 'address': '1060 W ADDISON'} 07/03/2012 {'date': '07/03/2012', 'address': '2122 N CLARK'} 07/04/2012 {'date': '07/04/2012', 'address': '5148 N CLARK'} {'date': '07/04/2012', 'address': '1039 W GRANVILLE'}
The groupby() function scans the entire sequence and finds a sequence of elements with consecutive identical values (or with the same return value according to the specified key function). At each iteration, it returns a value and an iterator object that generates all objects in a group whose element values are equal to the values above.
A very important preparation step is to sort the data according to the specified fields. Because groupby() only checks for contiguous elements, the grouping function will not get the desired result if it is not sorted in advance.
If you just want to group data into a large data structure according to the date field and allow random access, you'd better use defaultdict() to build a multivalued dictionary, which has been described in detail in section 1.6. For example:
from collections import defaultdict rows_by_date = defaultdict(list) for row in rows: rows_by_date[row['date']].append(row)
In this way, you can easily access the corresponding records for each specified date:
>>> for r in rows_by_date['07/01/2012']: ... print(r) ... {'date': '07/01/2012', 'address': '5412 N CLARK'} {'date': '07/01/2012', 'address': '4801 N BROADWAY'} >>>
In the example above, we don't need to sort the records first. Therefore, if you don't care much about memory usage, this approach runs faster than sorting first and then iterating through the groupby() function.