- Wed 28 March 2018
- Python
- #standard-library, #beginner, #intermediate
Counting things is an apparently trivial computational problem. In this article series — more of a journey than a short single-topic read — I’ll go over the process of producing a letter-based histogram from a piece of text, as a motive to progressively introduce several useful built-in and Standard Library tools.
Along the way I’ll share a few tips and thoughts on coding principles, related topics, like unicode text processing, and, for completeness, ideas on possible output production.
The key takeaway is that Python has many built-in and Standard Library tools that, used in the right combination, lead to simpler, shorter code, that being correct and easier to understand and manage, will also be faster.
The Task
Say we want to produce histogram representing the letter frequency — or counts — in a given piece of text. In a simple text analysis scenario, following good coding practices, we’d like to:
- Determine per-letter counts, regardless of the original letter case in the text.
A capital “A” should be accounted for just like a lower case “a”, as yet another occurrence of the letter “A”. - Discard any non-letters, like punctuation and whitespace.
- Produce a relatively nice human-readable result.
- Have clean, fast, robust and easy to read and modify code.
Being a “Journey on Counting”, we’ll mostly focus on counting and the many possible solutions (and challenges!) around it. Leaving the output production to a later stage, separating the counting process from the output handling, is a good mental and code design practice.
First Takes
As first approach to a solution we create a letter_counts.py
module with the letter_counts_take1
function:
def letter_counts_take1(text):
"""
Returns a dict where:
- keys are letters (strings of length 1).
- values are letter counts, from the text argument.
"""
letter_counts = {}
for letter in text:
if not letter in letter_counts:
# Haven't seen this letter yet: start counting at 1.
letter_counts[letter] = 1
else:
# Already tracking this letter's count: increment it.
letter_counts[letter] += 1
return letter_counts
We then give it a test run, in the Python interactive prompt:
$ python3 -i letter_counts.py
>>> letter_counts_take1('Hello there!')
{'H': 1, 'e': 3, 'l': 2, 'o': 1, ' ': 1, 't': 1, 'h': 1, 'r': 1, '!': 1}
>>>
It seems to be working, but it is still pretty limited:
- It’s counting
'H'
and'h'
separately. - It’s not discarding whitespace, nor punctuation.
Let’s continue by creating an improved letter_counts_take2
function in our module:
For the sake of brevity, I’ll be eliding the docstring in the remaining versions of letter_counts_...
functions.
import string
def letter_counts_take2(text):
""" ... """
letter_counts = {}
for letter in text:
if letter in string.whitespace or letter in string.punctuation:
# Don't count whitespace and punctuation.
continue
# Combine lower-/upper-case letter occurrences.
letter = letter.upper()
if not letter in letter_counts:
letter_counts[letter] = 1
else:
letter_counts[letter] += 1
return letter_counts
A few notes:
- I’m using
whitespace
andpunctuation
, from the Standard Library’sstring
module to determine whether a givenletter
is to be discarded or not; these are useful, predefined character strings containing common characters in the categories their names suggest. - Converting
letter
to its upper-case variant, and using it in the counting process, solves the problem of conflating the counts of upper- and lower-case letters; what you may not be aware of is that the.upper()
method of Python strings handles all letters, including accented- and non-latin, for example, with'resumé'.upper()
returning'RESUMÉ'
and'δ'.upper()
returning'Δ'
.
Some readers may now be thinking how accented letter counts could be combined with their non-accented variants. Others could highlight that string.whitespace
does not account for all possible whitespace in a text string. Both very legitimate ideas, which we will briefly explore later in the text.
Being orthogonal to the counting process itself, these improvements on text processing reflect common, real-world constraints on counting things: they often need to be filtered and/or pre-processed before the actual counting is done. Let’s do a quick check on how letter_counts_take2
behaves, using the Python interactive prompt, and bring our attention back to counting:
$ python3 -i letter_counts.py
>>> letter_counts_take2('Hello there!')
{'H': 2, 'E': 3, 'L': 2, 'O': 1, 'T': 1, 'R': 1}
>>>
It’s definitely looking better: we no longer have non-letters in the result, and the 'H'
count is now 2, as expected — problem solved!
Can we do better than this? We certainly can.
One common source of code complexity is branching. In simple terms, the more if
s in a code block, the more complex it is, making it harder to reason about, making it more difficult — if not completely unfeasible — to test all possible execution paths, etc. We could argue that the code is small and simple enough and that it should be left as is. We could also, of course, counter argue that, one day, in the future, it will need changes and improvements, to support new capabilities, or to address edge cases it may not handle correctly. That is the day when this small and simple code will grow. And we’d like to have it as simple as possible. All the time.
One improvement would convert this if
, checking whether or not a given letter is already accounted for, …
if not letter in letter_counts:
letter_counts[letter] = 1
else:
letter_counts[letter] += 1
…into something simpler and linear, like:
running_count = letter_counts.get(letter, 0)
letter_counts[letter] = running_count + 1
Using the dictionary’s .get(key[, default])
method we pass in 0
as the default running count for letter
, which ensures we get the actual running count for any letters already in the dictionary, and 0
for the ones not accounted for yet. Updating letter counts is then a simple addition and assignment away.
With this change, we could create letter_counts_take3
as:
import string
def letter_counts_take3(text):
""" ... """
letter_counts = {}
for letter in text:
if letter in string.whitespace or letter in string.punctuation:
continue
letter = letter.upper()
running_count = letter_counts.get(letter, 0)
letter_counts[letter] = running_count + 1
return letter_counts
Under the same simplification tone, avoiding branches, we might feel tempted to find a solution to eliminate the remaining if
in the counting loop. We will look at a possible solution, as we progress. For now, let’s dive into the exploration of built-in and Standard Library tools that can be of use.
Bringing in some Tools
The letter counting solutions we’ve created use a Python dictionary to track the per-letter counts, where the keys are letters and the associated values are the respective letter counts. This is, after all, the perfect use case for dictionaries. One thing we had to be careful about, though, was the problem of per-letter count initialization:
- We started off using an
if
statement.
It initialized the per-letter count to1
, if the given letter wasn’t account for yet, or incremented it, otherwise. - We later changed it to a more linear approach.
Getting letter counts using the dictionary’s.get()
method, defaulting to0
when unaccounted for, then incrementing it.
The Standard Library’s collections
module includes a very useful type — I’d say, unknown to many — with the sole purpose of simplifying dictionary value initialization — the defaultdict
. Let’s take a short break from counting things and learn a little bit about it.
collections.defaultdict
defaultdict
objects behave just like the built-in Python dictionaries with a single difference: whenever a non-existing key is read, instead of raising a KeyError
exception, the defaultdict
automatically creates a default value, associates it with the key, and returns it. In other words, when we read a key it either (i) exists and we get its value, or (ii) it’s created for us, assigned with a default value.
The default values used by defaultdict
objects are obtained from a “value factory” — a function1 that takes no arguments, passed to defaultdict
at creation time; when the defaultdict
needs to create a new default value, it calls that function and, whichever value is returned, is what it uses.
Let’s see it in action, step by step:
>>> from collections import defaultdict
>>> def my_factory():
... return 42
...
>>> dd = defaultdict(my_factory)
In an interactive Python session, we created the my_factory
function and a new defaultdict
(imported from the collections
module) that will use that function to create default values — note the function name passed in as an argument to defaultdict
, at creation time, in the last line above. It starts with no keys, and seems to behave like a normal dictionary:
>>> len(dd) # Just created, no keys.
0
>>> dd['A'] = 21 # Add a key as in a regular dictionary.
>>> dd['A']
21
>>> len(dd) # One key, now, as expected.
1
Let’s now see what happens when non-existing keys are accessed:
>>> 'B' in dd # The dictionary does not contain the 'B' key.
False
>>> dd['B'] # It is created, with the default value, on first access.
42
>>> 'B' in dd # The dictionary now contains the 'B' key.
True
The 'B'
key wasn’t initially in the dictionary (we knew that already, but confirmed it first, nonetheless). We then asked the defaultdict
for the value of that non-existing key. Holding no such key, the defaultdict
called its “value factory” to produce a default value, associating it to the requested key, and finally returning it. In our example, it called my_function
— passed in when creating the defaultdict
object — that always returns 42
. That’s why dd['B']
evaluates to 42
.
An equally interesting use case is:
>>> 'C' in dd # The dictionary does not contain the 'C' key.
False
>>> dd['C'] += 1000 # Three-steps: get dd['C'], add 1000, assign result to dd['C'].
>>> dd['C']
1042
Note how the augmented assignment +=
operator worked with the non-existing 'C'
key. Even though many would call it an “in-place” addition, that is not what’s going on, under the covers (it’s more of a rebinding operation, but let’s not get distracted by that, now). It can, however, be thought of as something equivalent to dd['C'] = dd['C'] + 1000
, which, hopefully, helps understanding why the dd['C'] += 1000
works and results in 1042
:
- The initial value of
dd['C']
needs to be determined.
Being a non-existent key, thedefaultdict
creates it automatically, assigning it the default value of42
. - Then
1000
is added to that value. - The resulting
1042
is assigned back todd['C']
.
Very often, defaultdict
objects are used with built-in types as their “value factories”, like int
s and list
s, for example. Since calling a given type returns an object of that type — int()
returns 0
, list()
returns an empty list []
, etc. — these prove to be very useful in common counting or grouping use cases. Let’s check two examples of that.
A simple counter can be created from a defaultdict(int)
:
>>> from collections import defaultdict
>>> int()
0
>>> counter = defaultdict(int)
>>> counter['A'] # How many 'A's have we counted so far?
0
>>> counter['B'] += 1 # Count one 'B'.
>>> counter['B'] += 1 # Count another 'B'.
>>> counter['B'] # How many 'B's have we counted so far?
2
A slightly more sophisticated object, keeping track of different groups of things, can be created from a defaultdict(list)
. Here’s an example using strings
, grouping animal names by their first letter (but using and grouping other object types is equally possible):
>>> from collections import defaultdict
>>> list()
[]
>>> grouper = defaultdict(list)
>>> grouper['A'] # What's in group 'A'?
[]
>>> grouper['B'].append('bee') # Add 'bee' to group 'B'.
>>> grouper['B'].append('bear') # Add 'bear' to group 'B'.
>>> grouper['B'] # What's in group 'B'?
['bee', 'bear']
>>> grouper['C'].append('cat') # Add 'cat' to group 'C'.
>>> grouper['C'] # What's in group 'C'?
['cat']
defaultdict
s without at least showing one other short and powerful use case, however, would be oversimplifying lot, and a severe understatement of their capabilities. Hopefully, this will inspire you to find more uses for them.
With the defaultdict
interlude completed, let’s get back to the letter counting function and see how we can simplify it further. Here’s what we had so far (copied here, no need to scroll up/down):
import string
def letter_counts_take3(text):
""" ... """
letter_counts = {}
for letter in text:
if letter in string.whitespace or letter in string.punctuation:
continue
letter = letter.upper()
running_count = letter_counts.get(letter, 0)
letter_counts[letter] = running_count + 1
return letter_counts
With what we now know about defaultdicts
we can make it shorter and more readable, which is always a good thing. More, relying on the existing Standard Library code, widely used and tested, is also a good principle: the more we get from it, the less we need to code and, importantly, the less we need verify and test. Here’s our take, with letter_counts_take4
:
from collections import defaultdict
import string
def letter_counts_take4(text):
""" ... """
letter_counts = defaultdict(int) # Better counter than a plain dictionary.
for letter in text:
if letter in string.whitespace or letter in string.punctuation:
continue
letter_counts[letter.upper()] += 1 # No need to initialize: just increment the count.
return letter_counts
Ok, it’s looking better and better. Can it be improved?
Well, it probably can, in several ways. There are two intertwined things that stand out, to me:
- On one hand, we have an
if
statement, used to discard non-letters. Code with noif
s is always simpler, how could we change it? - On the other, is the fact that we have a pretty much hard coded non-letter condition on the
if
. Could we avoid it?
Then, there is at least another one, not so obvious to many, but that we have alluded to before:
- We’re using a hard-coded letter transformation with
letter.upper()
to combine upper-/lower-case letters into common counts. This does not support, for example, conflating accented letter counts with their non-accented variations. How could it be improved?
These are possible improvements both in the code design and capabilities domains, not strictly Python things. Let’s explore them and see what we can come up with, starting with the idea of replacing the if
with something else that doesn’t involve branching.
A cursory glance at the code tells us that if we are to eliminate the non-letter detection if
statement in the loop, we will need to do the filtering beforehand. That means something along the lines of creating a filtered_text
object, copied from text
, where any non-letters are discarded. We could try to do that with a list comprehension…
filtered_text = [l for l in text if l not in string.whitespace and l not in string.punctuation]
…but that would be misleading and wrong (and not really computationally efficient, but that’s a whole other topic): the fact is that the if
statement we’re trying to eliminate is still somewhat half-hidden in the list comprehension. If we are to get rid of branches, this is not it.
A possible solution is in the little-used str.translate
method — from a sneak peek at the docstring we gather that:
- It takes a translation table as its single argument.
- Returns a copy of the string in which each character has been mapped through the given translation table.
- Characters mapped to
None
are deleted.
Let’s take another short break, now to explore and learn how str.translate
can help.
str.translate
and str.maketrans
The str.translate
method creates a new string after performing character based substitutions on the string it is called on. For that, it needs a translation table, defining which characters should be substituted by which.
The translation table is actually a Python dictionary where both keys and values are Unicode code points2: the keys define which characters should be substituted, the values indicate the substitution character. Characters with Unicode code points absent from the dictionary keys will be left untouched by str.translate
.
Let’s give it a simple test, using the built-in ord
function to get Unicode code points for individual characters:
>>> xlate_table = {
... ord('.'): ord('!'),
... ord('w'): ord('W'),
... }
>>> 'Hello world...'.translate(xlate_table)
'Hello World!!!'
It seems to work, the '.'
s were replaced by '!'
s and the lower-case 'w'
by an upper-case 'W'
, in a single pass. But creating the translation table was pretty tedious and repetitive (imagine doing that with more than a few substitutions!). That’s precisely the point of str.maketrans
— simplifying the creation of translation tables for str.translate
.
str.maketrans
can be invoked in various ways: one such possible way takes two equal-length string arguments, representing a translation table, stating that the characters on the first string should be replaced with characters in the same position on the second string. Let’s try it out:
>>> replace_these_ones = '.w'
>>> with_these_instead = '!W'
>>> xlate_table = str.maketrans(replace_these_ones, with_these_instead)
>>> 'Hello world...'.translate(xlate_table)
'Hello World!!!'
replace_these_ones
and with_these_instead
to help visualize the translation, vertically aligned in the code. Of course str.maketrans('.w', '!W')
would have been perfectly acceptable, if not preferred.
One last thing we should try with str.translate
is exploring its capabilities of removing characters during the translation process — recall, per its docstring, “Characters mapped to None
are deleted”. Fortunately, str.maketrans
can deal with that as well: passing in a third string argument, will produce a translation table than discards any characters it contains:
>>> xlate_table = str.maketrans('.w', '!W', ' ')
>>> 'Hello world...'.translate(xlate_table)
'HelloWorld!!!'
How about creating a translation table that just removes characters, say the ' '
whitespace and the '.'
?
>>> filter_table = str.maketrans('', '', ' .')
>>> 'Hello world...'.translate(filter_table)
'Helloworld'
Works perfectly, all we had to do was pass in two zero-length strings as the translation mapping strings — meaning don’t translate.
With what we’ve just seen about str.translate
and str.maketrans
, we’re ready to take a stab at eliminating the remaining if
in our code. Here’s how letter_count_take5
can look like:
from collections import defaultdict
import string
def letter_counts_take5(text):
""" ... """
letter_filter = str.maketrans('', '', string.whitespace + string.punctuation)
filtered_text = text.translate(letter_filter)
letter_counts = defaultdict(int)
for letter in filtered_text:
letter_counts[letter.upper()] += 1
return letter_counts
Could we say it’s better now? I say we can: we have the same amount of code, in line count, but we got rid of a branching statement inside the loop — simpler code, easier to test, easier to manage. The cost is consuming additional RAM for the filtered_text
string and using str.translate
and str.maketrans
which may not be 100% familiar to every Pythonista out there (but always a help(str)
away, thus, hardly a cost). The benefit is less branching (in our code, of course, str.translate
will need to branch somehow, but we don’t care) and, probably, increased performance given that str.translate
is implemented in C, very well suited to tight loop processing, as is the case.
What next? Well, expanding on the str.translate
idea, we can avoid the letter.upper()
call in every loop iteration, striving for a single-pass, upper-casing and character filtering operation. Here’s a possible implementation:
from collections import defaultdict
import string as s
def letter_counts_take6(text):
""" ... """
# Upper-case the 26 english alphabet letters and discard whitespace and punctuation.
xlate_table = str.maketrans(s.ascii_lowercase, s.ascii_uppercase, s.whitespace + s.punctuation)
just_letters = text.translate(xlate_table)
letter_counts = defaultdict(int)
for letter in just_letters:
letter_counts[letter] += 1
return letter_counts
Whether this is a good idea or not depends on our ultimate purpose, and established requirements. Attentive readers may notice that, now using str.translate
from string.ascii_lowercase
to string.ascii_uppercase
, this version does not combine lower-/upper-case variations of accented letters — it now counts 'é'
and 'É'
separately, while the previous one combined them (but neither one merges the 'é'
, 'É'
and 'e'
counts, though). We could build a translation table that, in a single pass, would have str.translate
upper-casing and stripping accents from all letters in a given domain, while also filtering out unwanted whitespace and punctuation, but there’s no immediate solution to that (more on this, later).
What this formulation suggests, given the barebones letter counting loop, is simplifying it further using yet another tool in the Standard Library’s collection
module — the Counter
.
collections.Counter
Counter
objects are like dictionaries, having keys and associated values, with the single-purpose of counting things. The idea is simple and pretty much in line with the letter counting dictionaries and defaultdict
s we’ve been using: Counter
object keys represent the things to be counted, its values represent the respective count.
One common use of Counter
objects is creating them with a single iterable argument, whose items must be hashable (they will become keys in the dictionary-like Counter
, after all). In this case, items in the iterable will immediately be counted — here’s an example of that:
>>> from collections import Counter
>>> numbers = [1, 1, 2, 2, 2, 2, 3]
>>> c = Counter(numbers)
>>> c[2] # What's the count for 2?
4
>>> c # Give me all the counts.
Counter({2: 4, 1: 2, 3: 1})
Of course, it also works with other iterables, like strings, counting letters thus:
>>> from collections import Counter
>>> letters = 'aabbbc'
>>> c = Counter(letters)
>>> c['a'] # What's the count for 'a'?
2
>>> c # Give me all the counts.
Counter({'b': 3, 'a': 2, 'c': 1})
>>> c['x'] # The count of an unknown is 0.
0
Counter
objects support several useful counting-related methods and operations, and I would recommend you take a look a them, if only for a brief moment more, if you’re not familiar with their operation.
With this knowledge, the plain letter counting loop and its associated defaultdict
— from letter_counts_take6
— can be replaced with a Counter
object, doing precisely the same thing:
from collections import Counter
import string as s
def letter_counts_take7(text):
""" ... """
xlate_table = str.maketrans(s.ascii_lowercase, s.ascii_uppercase, s.whitespace + s.punctuation)
just_letters = text.translate(xlate_table)
return Counter(just_letters)
This is definitely shorter and simpler: there is no branching or looping in our code. It’s hidden away, and handled by existing, widely tested, probably faster code, in the built-in and Standard Library tools. What’s not to like about it?
There certainly still is the fact that the letter grouping, via upper-casing, and filtering is hard-coded in the str.maketrans
call. That’s something we’d probably like to tackle. But let’s step back for a moment, take a deep breath, and review our progress.
Progress Review
To summarize the coding journey so far, I’ve created the following table highlighting a few indicators for each successive take on the letter_counts_...
function, including notes about the changes in each:
Take | Code | Branches | Loops | Notes |
---|---|---|---|---|
1 | 7 lines | 1x if |
1x for |
Prototype: non-letters counted, upper-/lower-case not combined. |
2 | 10 lines | 2x if |
1x for |
First working version. |
3 | 8 lines | 1x if |
1x for |
No branching in initializing the letter_counts dictionary. |
4 | 6 lines | 1x if |
1x for |
Using a defaultdict instead of a plain dictionary. |
5 | 6 lines | None | 1x for |
No branching in letter filtering, using str.translate instead. |
6 | 6 lines | None | 1x for |
Leverage str.translate to combine upper-/lower-case counts. |
7 | 3 lines | None | None | Don’t loop explicitly, use a Counter instead. |
Note how, from the first working version onwards, the line count kept going down. Likewise, the number for branches and loops was also progressively reduced. This is a good indicator: short, linear code is always easier to understand and maintain. It’s also eventually faster. Let’s check that.
To measure the performance of each implementation, three differently sized text inputs were used:
- Small - The string
'Hello there!'
. - Medium - The string
'You don’t know about me, without you have read a book by the name of The Adventures of Tom Sawyer; but that ain’t no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly.'
, the first sentences in Mark Twain’s, Adventures of Huckleberry Finn. - Large - 100 concatenations of the medium string.
Then, using the timeit
module in the Standard Library, various command lines like this one — with all combinations of input text sizes against all letter_counts_...
takes — were scripted and ran (broken here into three different lines, for the sake of readability):
$ python3 -m timeit \
-s 'from letter_counts import letter_counts_take1; text="Hello there!"' \
-c 'letter_counts_take1(text)'
Here are the results for the durations of each function call and the relative duration of subsequent takes vs. the first working take:
Take | Small | Medium | Large | Notes |
---|---|---|---|---|
1 | 1.31μs | 22.1μs | 2.42ms | Prototype: non-letters counted, upper-/lower-case not combined. |
2 | 4.21μs | 1.00x | 64.5μs | 1.00x | 6.84ms | 1.00x | First working version. |
3 | 5.18μs | 1.23x | 73.2μs | 1.13x | 7.47ms | 1.09x | No branching in initializing the letter_counts dictionary. |
4 | 5.23μs | 1.24x | 63.5μs | 0.98x | 6.13ms | 0.90x | Using a defaultdict instead of a plain dictionary. |
5 | 6.27μs | 1.49x | 52.8μs | 0.82x | 4.97ms | 0.73x | No branching in letter filtering, using str.translate instead. |
6 | 5.88μs | 1.40x | 28.3μs | 0.44x | 2.26ms | 0.33x | Leverage str.translate to combine upper-/lower-case counts. |
7 | 7.98μs | 1.90x | 22.4μs | 0.35x | 1.65ms | 0.24x | Don’t loop explicitly, use a Counter instead. |
A few general observations:
- When the input text is small, code-wise improvements do not result in positive performance improvements.
Almost all code improvement steps lead to progressively slower execution times. - When the input text grows in size to medium or large, though, improvements are clear.
Take 7 (the simplest code-wise version so far) takes <25% of the time of Take 1 when the input is large enough. - This indicates that there is a constant overhead, not proportional to the input text size, being introduced at each step.
In more detail, focusing on medium and large text input size measurements:
- Avoiding the branching, in Take 3, using
dict.get()
, introduced a performance penalty. - Using a
defaultdict
, in Take 4, compensated for that and shows the first speed improvement, even if modest. - The biggest speed jump was obtained by using
str.translate
to filter and conflate upper-/lower-case letters in a single pass.
For medium and large text inputs Take 6 is more than twice as fast as Take 4. - Using
Counter
in Take 7, once the counting process was simple enough, gave us an extra, effortless speed boost.
So, where do we stand? We have a very short and clean implementation in letter_counts_take7
, that is also the fastest so far, as long as the input text is large enough: ~200 characters or more, per the medium sized input text measurements above. Importantly, it does meet the initially defined requirements — let’s not forget that.
Before calling the counting process done, and moving on to exploring ideas on output generation, there is one more thing we should address (yes, even though our code is short and fast and meets the requirements). The fact that the letter filtering and grouping is based on non-changable, hard-coded values, using string.ascii_lowercase
and its friends. This is unnecessarily limiting and, in general, a bad coding practice.
In the next article in this series we’ll explore just that, hopefully getting to a point where we’re satisfied with the letter-counting process, finally moving on to producing nice and accessible human-readable output. As we will see, eliminating the hard-coded parts in our function will lead us to a path of Unicode and API design exploration: two topics with a lot to be said by themselves.
We’ll see where that takes us.
[ 2018-04-29 UPDATE: Follow-up article here. ]
Wrap up
This article kicks-off a Python journey, motivated by the topic of counting letters. We’ve built several versions of letter counting functions, progressively simpler, using powerful built-in and Standard Library tools. Knowing them and how they can be applied to a given problem is a valuable skill — like with other skills, mastering it takes time; don’t feel rushed, just keep at it, at your own pace. We’ve also highlighted a few universal coding principles: these are not related to knowing about, or using, any specific tools, they’re not even Python specific.
Summarizing, I would highlight:
-
Counting things, like many computational problems, looks simple.
It often is, but the need for filtering and processing is common, which introduces complexity.
The devil is in the detail, a very appropriate idiom here: the more requirements, the more prone to complexity a given problem becomes. Creating simple code to solve complex problems is, by definition, not a simple task, but one we should strive for. If possible, agreeing on the simplest requirements possible, is a good idea, while coding with an eye into the future to accommodate change. -
The Python Standard Library includes many useful tools, always at hand.
Learning its ways is more of a process than an objective in itself. Thecollections
module contains, among other useful types, thedefaultdict
and theCounter
which we explored and learned about: they both help writing simpler, shorter and faster code. While theCounter
is very much a single-purpose thing, thedefaultdict
has many useful applications, including counting, grouping, and more. Thestring
module — not to be confused with thestr
built-in type, also very powerful — includes a few little nuggets that, again, we can build upon and use to simplify existing code. -
The simpler the code, the better.
Less code is always better code: there’s less to read and understand, less to be executed, less to be maintained and changed in the future. It will also tend to have fewer bugs, assuming that, statistically, bug counts per line are constant, in a given setting. Simpler code also means less branching and looping: linear code — with noif
s,for
s orwhile
s — is easier to grasp and test. We can’t always create it like so, but trying to minimize those is a good long term practice. Simpler code is faster: for both humans and computers.
Thanks for reading. See you soon.
-
A function or, more generically, a callable: in layman’s terms, a callable is an object we can call by appending parenthesized arguments to it (a few examples: the
len
function is a callable, being called withlen(...)
; classes are callable, returning objects of that class, like the built-inlist
which is callable withlist()
, returning a newlist
object). ↩ -
Simply put, a Unicode code point is a number representing a character. To learn more, refer to this rather simplistic Wikipedia article and to the comprehensive Unicode documentation, maybe starting with the glossary. ↩