A Python Journey on Counting: Foundation

Wed 28 March 2018
Python
#standard-library, #beginner, #intermediate

Counting things is an apparently trivial computational problem. In this article series — more of a journey than a short single-topic read — I’ll go over the process of producing a letter-based histogram from a piece of text, as a motive to progressively introduce several useful built-in and Standard Library tools.

Along the way I’ll share a few tips and thoughts on coding principles, related topics, like unicode text processing, and, for completeness, ideas on possible output production.

The key takeaway is that Python has many built-in and Standard Library tools that, used in the right combination, lead to simpler, shorter code, that being correct and easier to understand and manage, will also be faster.

The Task

Say we want to produce histogram representing the letter frequency — or counts — in a given piece of text. In a simple text analysis scenario, following good coding practices, we’d like to:

Determine per-letter counts, regardless of the original letter case in the text.
A capital “A” should be accounted for just like a lower case “a”, as yet another occurrence of the letter “A”.
Discard any non-letters, like punctuation and whitespace.
Produce a relatively nice human-readable result.
Have clean, fast, robust and easy to read and modify code.

Being a “Journey on Counting”, we’ll mostly focus on counting and the many possible solutions (and challenges!) around it. Leaving the output production to a later stage, separating the counting process from the output handling, is a good mental and code design practice.

First Takes

As first approach to a solution we create a letter_counts.py module with the letter_counts_take1 function:

def letter_counts_take1(text):
    """
    Returns a dict where:
    - keys are letters (strings of length 1).
    - values are letter counts, from the text argument.
    """
    letter_counts = {}
    for letter in text:
        if not letter in letter_counts:
            # Haven't seen this letter yet: start counting at 1.
            letter_counts[letter] = 1
        else:
            # Already tracking this letter's count: increment it.
            letter_counts[letter] += 1
    return letter_counts

We then give it a test run, in the Python interactive prompt:

$ python3 -i letter_counts.py
>>> letter_counts_take1('Hello there!')
{'H': 1, 'e': 3, 'l': 2, 'o': 1, ' ': 1, 't': 1, 'h': 1, 'r': 1, '!': 1}
>>>

It seems to be working, but it is still pretty limited:

It’s counting 'H' and 'h' separately.
It’s not discarding whitespace, nor punctuation.

Let’s continue by creating an improved letter_counts_take2 function in our module:

For the sake of brevity, I’ll be eliding the docstring in the remaining versions of letter_counts_... functions.

import string

def letter_counts_take2(text):
    """ ... """
    letter_counts = {}
    for letter in text:
        if letter in string.whitespace or letter in string.punctuation:
            # Don't count whitespace and punctuation.
            continue
        # Combine lower-/upper-case letter occurrences.
        letter = letter.upper()
        if not letter in letter_counts:
            letter_counts[letter] = 1
        else:
            letter_counts[letter] += 1
    return letter_counts

A few notes:

I’m using whitespace and punctuation, from the Standard Library’s string module to determine whether a given letter is to be discarded or not; these are useful, predefined character strings containing common characters in the categories their names suggest.
Converting letter to its upper-case variant, and using it in the counting process, solves the problem of conflating the counts of upper- and lower-case letters; what you may not be aware of is that the .upper() method of Python strings handles all letters, including accented- and non-latin, for example, with 'resumé'.upper() returning 'RESUMÉ' and 'δ'.upper() returning 'Δ'.

Some readers may now be thinking how accented letter counts could be combined with their non-accented variants. Others could highlight that string.whitespace does not account for all possible whitespace in a text string. Both very legitimate ideas, which we will briefly explore later in the text.

Being orthogonal to the counting process itself, these improvements on text processing reflect common, real-world constraints on counting things: they often need to be filtered and/or pre-processed before the actual counting is done. Let’s do a quick check on how letter_counts_take2 behaves, using the Python interactive prompt, and bring our attention back to counting:

$ python3 -i letter_counts.py
>>> letter_counts_take2('Hello there!')
{'H': 2, 'E': 3, 'L': 2, 'O': 1, 'T': 1, 'R': 1}
>>>

It’s definitely looking better: we no longer have non-letters in the result, and the 'H' count is now 2, as expected — problem solved!

Can we do better than this? We certainly can.

One common source of code complexity is branching. In simple terms, the more ifs in a code block, the more complex it is, making it harder to reason about, making it more difficult — if not completely unfeasible — to test all possible execution paths, etc. We could argue that the code is small and simple enough and that it should be left as is. We could also, of course, counter argue that, one day, in the future, it will need changes and improvements, to support new capabilities, or to address edge cases it may not handle correctly. That is the day when this small and simple code will grow. And we’d like to have it as simple as possible. All the time.

One improvement would convert this if, checking whether or not a given letter is already accounted for, …

        if not letter in letter_counts:
            letter_counts[letter] = 1
        else:
            letter_counts[letter] += 1

…into something simpler and linear, like:

        running_count = letter_counts.get(letter, 0)
        letter_counts[letter] = running_count + 1

Using the dictionary’s .get(key[, default]) method we pass in 0 as the default running count for letter, which ensures we get the actual running count for any letters already in the dictionary, and 0 for the ones not accounted for yet. Updating letter counts is then a simple addition and assignment away.

With this change, we could create letter_counts_take3 as:

import string

def letter_counts_take3(text):
    """ ... """
    letter_counts = {}
    for letter in text:
        if letter in string.whitespace or letter in string.punctuation:
            continue
        letter = letter.upper()
        running_count = letter_counts.get(letter, 0)
        letter_counts[letter] = running_count + 1
    return letter_counts

Under the same simplification tone, avoiding branches, we might feel tempted to find a solution to eliminate the remaining if in the counting loop. We will look at a possible solution, as we progress. For now, let’s dive into the exploration of built-in and Standard Library tools that can be of use.

Bringing in some Tools

The letter counting solutions we’ve created use a Python dictionary to track the per-letter counts, where the keys are letters and the associated values are the respective letter counts. This is, after all, the perfect use case for dictionaries. One thing we had to be careful about, though, was the problem of per-letter count initialization:

We started off using an if statement.
It initialized the per-letter count to 1, if the given letter wasn’t account for yet, or incremented it, otherwise.
We later changed it to a more linear approach.
Getting letter counts using the dictionary’s .get() method, defaulting to 0 when unaccounted for, then incrementing it.

The Standard Library’s collections module includes a very useful type — I’d say, unknown to many — with the sole purpose of simplifying dictionary value initialization — the defaultdict. Let’s take a short break from counting things and learn a little bit about it.

`collections.defaultdict`

defaultdict objects behave just like the built-in Python dictionaries with a single difference: whenever a non-existing key is read, instead of raising a KeyError exception, the defaultdict automatically creates a default value, associates it with the key, and returns it. In other words, when we read a key it either (i) exists and we get its value, or (ii) it’s created for us, assigned with a default value.

The default values used by defaultdict objects are obtained from a “value factory” — a function¹ that takes no arguments, passed to defaultdict at creation time; when the defaultdict needs to create a new default value, it calls that function and, whichever value is returned, is what it uses.

Let’s see it in action, step by step:

>>> from collections import defaultdict
>>> def my_factory():
...     return 42
...
>>> dd = defaultdict(my_factory)

In an interactive Python session, we created the my_factory function and a new defaultdict (imported from the collections module) that will use that function to create default values — note the function name passed in as an argument to defaultdict, at creation time, in the last line above. It starts with no keys, and seems to behave like a normal dictionary:

>>> len(dd)                 # Just created, no keys.
0
>>> dd['A'] = 21            # Add a key as in a regular dictionary.
>>> dd['A']
21
>>> len(dd)                 # One key, now, as expected.
1

Let’s now see what happens when non-existing keys are accessed:

>>> 'B' in dd               # The dictionary does not contain the 'B' key.
False
>>> dd['B']                 # It is created, with the default value, on first access.
42
>>> 'B' in dd               # The dictionary now contains the 'B' key.
True

The 'B' key wasn’t initially in the dictionary (we knew that already, but confirmed it first, nonetheless). We then asked the defaultdict for the value of that non-existing key. Holding no such key, the defaultdict called its “value factory” to produce a default value, associating it to the requested key, and finally returning it. In our example, it called my_function — passed in when creating the defaultdict object — that always returns 42. That’s why dd['B'] evaluates to 42.

An equally interesting use case is:

>>> 'C' in dd               # The dictionary does not contain the 'C' key.
False
>>> dd['C'] += 1000         # Three-steps: get dd['C'], add 1000, assign result to dd['C'].
>>> dd['C']
1042

Note how the augmented assignment += operator worked with the non-existing 'C' key. Even though many would call it an “in-place” addition, that is not what’s going on, under the covers (it’s more of a rebinding operation, but let’s not get distracted by that, now). It can, however, be thought of as something equivalent to dd['C'] = dd['C'] + 1000, which, hopefully, helps understanding why the dd['C'] += 1000 works and results in 1042:

The initial value of dd['C'] needs to be determined.
Being a non-existent key, the defaultdict creates it automatically, assigning it the default value of 42.
Then 1000 is added to that value.
The resulting 1042 is assigned back to dd['C'].

Very often, defaultdict objects are used with built-in types as their “value factories”, like ints and lists, for example. Since calling a given type returns an object of that type — int() returns 0, list() returns an empty list [], etc. — these prove to be very useful in common counting or grouping use cases. Let’s check two examples of that.

A simple counter can be created from a defaultdict(int):

>>> from collections import defaultdict
>>> int()
0
>>> counter = defaultdict(int)
>>> counter['A']                    # How many 'A's have we counted so far?
0
>>> counter['B'] += 1               # Count one 'B'.
>>> counter['B'] += 1               # Count another 'B'.
>>> counter['B']                    # How many 'B's have we counted so far?
2

A slightly more sophisticated object, keeping track of different groups of things, can be created from a defaultdict(list). Here’s an example using strings, grouping animal names by their first letter (but using and grouping other object types is equally possible):

>>> from collections import defaultdict
>>> list()
[]
>>> grouper = defaultdict(list)
>>> grouper['A']                    # What's in group 'A'?
[]
>>> grouper['B'].append('bee')      # Add 'bee' to group 'B'.
>>> grouper['B'].append('bear')     # Add 'bear' to group 'B'.
>>> grouper['B']                    # What's in group 'B'?
['bee', 'bear']
>>> grouper['C'].append('cat')      # Add 'cat' to group 'C'.
>>> grouper['C']                    # What's in group 'C'?
['cat']

This example diverges a little from the “counting things” topic, agreed. Introducing defaultdicts without at least showing one other short and powerful use case, however, would be oversimplifying lot, and a severe understatement of their capabilities. Hopefully, this will inspire you to find more uses for them.

With the defaultdict interlude completed, let’s get back to the letter counting function and see how we can simplify it further. Here’s what we had so far (copied here, no need to scroll up/down):

import string

def letter_counts_take3(text):
    """ ... """
    letter_counts = {}
    for letter in text:
        if letter in string.whitespace or letter in string.punctuation:
            continue
        letter = letter.upper()
        running_count = letter_counts.get(letter, 0)
        letter_counts[letter] = running_count + 1
    return letter_counts

With what we now know about defaultdicts we can make it shorter and more readable, which is always a good thing. More, relying on the existing Standard Library code, widely used and tested, is also a good principle: the more we get from it, the less we need to code and, importantly, the less we need verify and test. Here’s our take, with letter_counts_take4:

from collections import defaultdict
import string

def letter_counts_take4(text):
    """ ... """
    letter_counts = defaultdict(int)            # Better counter than a plain dictionary.
    for letter in text:
        if letter in string.whitespace or letter in string.punctuation:
            continue
        letter_counts[letter.upper()] += 1      # No need to initialize: just increment the count.
    return letter_counts

Ok, it’s looking better and better. Can it be improved?

Well, it probably can, in several ways. There are two intertwined things that stand out, to me:

On one hand, we have an if statement, used to discard non-letters. Code with no ifs is always simpler, how could we change it?
On the other, is the fact that we have a pretty much hard coded non-letter condition on the if. Could we avoid it?

Then, there is at least another one, not so obvious to many, but that we have alluded to before:

We’re using a hard-coded letter transformation with letter.upper() to combine upper-/lower-case letters into common counts. This does not support, for example, conflating accented letter counts with their non-accented variations. How could it be improved?

These are possible improvements both in the code design and capabilities domains, not strictly Python things. Let’s explore them and see what we can come up with, starting with the idea of replacing the if with something else that doesn’t involve branching.

A cursory glance at the code tells us that if we are to eliminate the non-letter detection if statement in the loop, we will need to do the filtering beforehand. That means something along the lines of creating a filtered_text object, copied from text, where any non-letters are discarded. We could try to do that with a list comprehension…

filtered_text = [l for l in text if l not in string.whitespace and l not in string.punctuation]

…but that would be misleading and wrong (and not really computationally efficient, but that’s a whole other topic): the fact is that the if statement we’re trying to eliminate is still somewhat half-hidden in the list comprehension. If we are to get rid of branches, this is not it.

A possible solution is in the little-used str.translate method — from a sneak peek at the docstring we gather that:

It takes a translation table as its single argument.
Returns a copy of the string in which each character has been mapped through the given translation table.
Characters mapped to None are deleted.

Let’s take another short break, now to explore and learn how str.translate can help.

`str.translate` and `str.maketrans`

The str.translate method creates a new string after performing character based substitutions on the string it is called on. For that, it needs a translation table, defining which characters should be substituted by which.

The translation table is actually a Python dictionary where both keys and values are Unicode code points²: the keys define which characters should be substituted, the values indicate the substitution character. Characters with Unicode code points absent from the dictionary keys will be left untouched by str.translate.

Let’s give it a simple test, using the built-in ord function to get Unicode code points for individual characters:

>>> xlate_table = {
...     ord('.'): ord('!'),
...     ord('w'): ord('W'),
... }
>>> 'Hello world...'.translate(xlate_table)
'Hello World!!!'

It seems to work, the '.'s were replaced by '!'s and the lower-case 'w' by an upper-case 'W', in a single pass. But creating the translation table was pretty tedious and repetitive (imagine doing that with more than a few substitutions!). That’s precisely the point of str.maketrans — simplifying the creation of translation tables for str.translate.

str.maketrans can be invoked in various ways: one such possible way takes two equal-length string arguments, representing a translation table, stating that the characters on the first string should be replaced with characters in the same position on the second string. Let’s try it out:

>>> replace_these_ones = '.w'
>>> with_these_instead = '!W'
>>> xlate_table = str.maketrans(replace_these_ones, with_these_instead)
>>> 'Hello world...'.translate(xlate_table)
'Hello World!!!'

I purposely created the variables replace_these_ones and with_these_instead to help visualize the translation, vertically aligned in the code. Of course str.maketrans('.w', '!W') would have been perfectly acceptable, if not preferred.

One last thing we should try with str.translate is exploring its capabilities of removing characters during the translation process — recall, per its docstring, “Characters mapped to None are deleted”. Fortunately, str.maketrans can deal with that as well: passing in a third string argument, will produce a translation table than discards any characters it contains:

>>> xlate_table = str.maketrans('.w', '!W', ' ')
>>> 'Hello world...'.translate(xlate_table)
'HelloWorld!!!'

How about creating a translation table that just removes characters, say the ' ' whitespace and the '.'?

>>> filter_table = str.maketrans('', '', ' .')
>>> 'Hello world...'.translate(filter_table)
'Helloworld'

Works perfectly, all we had to do was pass in two zero-length strings as the translation mapping strings — meaning don’t translate.

With what we’ve just seen about str.translate and str.maketrans, we’re ready to take a stab at eliminating the remaining if in our code. Here’s how letter_count_take5 can look like:

from collections import defaultdict
import string

def letter_counts_take5(text):
    """ ... """
    letter_filter = str.maketrans('', '', string.whitespace + string.punctuation)
    filtered_text = text.translate(letter_filter)
    letter_counts = defaultdict(int)
    for letter in filtered_text:
        letter_counts[letter.upper()] += 1
    return letter_counts

Could we say it’s better now? I say we can: we have the same amount of code, in line count, but we got rid of a branching statement inside the loop — simpler code, easier to test, easier to manage. The cost is consuming additional RAM for the filtered_text string and using str.translate and str.maketrans which may not be 100% familiar to every Pythonista out there (but always a help(str) away, thus, hardly a cost). The benefit is less branching (in our code, of course, str.translate will need to branch somehow, but we don’t care) and, probably, increased performance given that str.translate is implemented in C, very well suited to tight loop processing, as is the case.

What next? Well, expanding on the str.translate idea, we can avoid the letter.upper() call in every loop iteration, striving for a single-pass, upper-casing and character filtering operation. Here’s a possible implementation:

from collections import defaultdict
import string as s

def letter_counts_take6(text):
    """ ... """
    # Upper-case the 26 english alphabet letters and discard whitespace and punctuation.
    xlate_table = str.maketrans(s.ascii_lowercase, s.ascii_uppercase, s.whitespace + s.punctuation)
    just_letters = text.translate(xlate_table)
    letter_counts = defaultdict(int)
    for letter in just_letters:
        letter_counts[letter] += 1
    return letter_counts

Whether this is a good idea or not depends on our ultimate purpose, and established requirements. Attentive readers may notice that, now using str.translate from string.ascii_lowercase to string.ascii_uppercase, this version does not combine lower-/upper-case variations of accented letters — it now counts 'é' and 'É' separately, while the previous one combined them (but neither one merges the 'é', 'É' and 'e' counts, though). We could build a translation table that, in a single pass, would have str.translate upper-casing and stripping accents from all letters in a given domain, while also filtering out unwanted whitespace and punctuation, but there’s no immediate solution to that (more on this, later).

What this formulation suggests, given the barebones letter counting loop, is simplifying it further using yet another tool in the Standard Library’s collection module — the Counter.

`collections.Counter`

Counter objects are like dictionaries, having keys and associated values, with the single-purpose of counting things. The idea is simple and pretty much in line with the letter counting dictionaries and defaultdicts we’ve been using: Counter object keys represent the things to be counted, its values represent the respective count.

One common use of Counter objects is creating them with a single iterable argument, whose items must be hashable (they will become keys in the dictionary-like Counter, after all). In this case, items in the iterable will immediately be counted — here’s an example of that:

>>> from collections import Counter
>>> numbers = [1, 1, 2, 2, 2, 2, 3]
>>> c = Counter(numbers)
>>> c[2]                            # What's the count for 2?
4
>>> c                               # Give me all the counts.
Counter({2: 4, 1: 2, 3: 1})

Of course, it also works with other iterables, like strings, counting letters thus:

>>> from collections import Counter
>>> letters = 'aabbbc'
>>> c = Counter(letters)
>>> c['a']                          # What's the count for 'a'?
2
>>> c                               # Give me all the counts.
Counter({'b': 3, 'a': 2, 'c': 1})
>>> c['x']                          # The count of an unknown is 0.
0

Counter objects support several useful counting-related methods and operations, and I would recommend you take a look a them, if only for a brief moment more, if you’re not familiar with their operation.

With this knowledge, the plain letter counting loop and its associated defaultdict — from letter_counts_take6 — can be replaced with a Counter object, doing precisely the same thing:

from collections import Counter
import string as s

def letter_counts_take7(text):
    """ ... """
    xlate_table = str.maketrans(s.ascii_lowercase, s.ascii_uppercase, s.whitespace + s.punctuation)
    just_letters = text.translate(xlate_table)
    return Counter(just_letters)

This is definitely shorter and simpler: there is no branching or looping in our code. It’s hidden away, and handled by existing, widely tested, probably faster code, in the built-in and Standard Library tools. What’s not to like about it?

There certainly still is the fact that the letter grouping, via upper-casing, and filtering is hard-coded in the str.maketrans call. That’s something we’d probably like to tackle. But let’s step back for a moment, take a deep breath, and review our progress.

Progress Review

To summarize the coding journey so far, I’ve created the following table highlighting a few indicators for each successive take on the letter_counts_... function, including notes about the changes in each:

Take	Code	Branches	Loops	Notes
1	7 lines	1x `if`	1x `for`	Prototype: non-letters counted, upper-/lower-case not combined.
2	10 lines	2x `if`	1x `for`	First working version.
3	8 lines	1x `if`	1x `for`	No branching in initializing the `letter_counts` dictionary.
4	6 lines	1x `if`	1x `for`	Using a `defaultdict` instead of a plain dictionary.
5	6 lines	None	1x `for`	No branching in letter filtering, using `str.translate` instead.
6	6 lines	None	1x `for`	Leverage `str.translate` to combine upper-/lower-case counts.
7	3 lines	None	None	Don’t loop explicitly, use a `Counter` instead.

Note how, from the first working version onwards, the line count kept going down. Likewise, the number for branches and loops was also progressively reduced. This is a good indicator: short, linear code is always easier to understand and maintain. It’s also eventually faster. Let’s check that.

To measure the performance of each implementation, three differently sized text inputs were used:

Small - The string 'Hello there!'.
Medium - The string 'You don’t know about me, without you have read a book by the name of The Adventures of Tom Sawyer; but that ain’t no matter. That book was made by Mr. Mark Twain, and he told the truth, mainly.', the first sentences in Mark Twain’s, Adventures of Huckleberry Finn.
Large - 100 concatenations of the medium string.

Then, using the timeit module in the Standard Library, various command lines like this one — with all combinations of input text sizes against all letter_counts_... takes — were scripted and ran (broken here into three different lines, for the sake of readability):

$ python3 -m timeit \
  -s 'from letter_counts import letter_counts_take1; text="Hello there!"' \
  -c 'letter_counts_take1(text)'

Here are the results for the durations of each function call and the relative duration of subsequent takes vs. the first working take:

Take	Small	Medium	Large	Notes
1	1.31μs	22.1μs	2.42ms	Prototype: non-letters counted, upper-/lower-case not combined.
2	4.21μs \| 1.00x	64.5μs \| 1.00x	6.84ms \| 1.00x	First working version.
3	5.18μs \| 1.23x	73.2μs \| 1.13x	7.47ms \| 1.09x	No branching in initializing the `letter_counts` dictionary.
4	5.23μs \| 1.24x	63.5μs \| 0.98x	6.13ms \| 0.90x	Using a `defaultdict` instead of a plain dictionary.
5	6.27μs \| 1.49x	52.8μs \| 0.82x	4.97ms \| 0.73x	No branching in letter filtering, using `str.translate` instead.
6	5.88μs \| 1.40x	28.3μs \| 0.44x	2.26ms \| 0.33x	Leverage `str.translate` to combine upper-/lower-case counts.
7	7.98μs \| 1.90x	22.4μs \| 0.35x	1.65ms \| 0.24x	Don’t loop explicitly, use a `Counter` instead.

A few general observations:

When the input text is small, code-wise improvements do not result in positive performance improvements.
Almost all code improvement steps lead to progressively slower execution times.
When the input text grows in size to medium or large, though, improvements are clear.
Take 7 (the simplest code-wise version so far) takes <25% of the time of Take 1 when the input is large enough.
This indicates that there is a constant overhead, not proportional to the input text size, being introduced at each step.

In more detail, focusing on medium and large text input size measurements:

Avoiding the branching, in Take 3, using dict.get(), introduced a performance penalty.
Using a defaultdict, in Take 4, compensated for that and shows the first speed improvement, even if modest.
The biggest speed jump was obtained by using str.translate to filter and conflate upper-/lower-case letters in a single pass.
For medium and large text inputs Take 6 is more than twice as fast as Take 4.
Using Counter in Take 7, once the counting process was simple enough, gave us an extra, effortless speed boost.

So, where do we stand? We have a very short and clean implementation in letter_counts_take7, that is also the fastest so far, as long as the input text is large enough: ~200 characters or more, per the medium sized input text measurements above. Importantly, it does meet the initially defined requirements — let’s not forget that.

Before calling the counting process done, and moving on to exploring ideas on output generation, there is one more thing we should address (yes, even though our code is short and fast and meets the requirements). The fact that the letter filtering and grouping is based on non-changable, hard-coded values, using string.ascii_lowercase and its friends. This is unnecessarily limiting and, in general, a bad coding practice.

In the next article in this series we’ll explore just that, hopefully getting to a point where we’re satisfied with the letter-counting process, finally moving on to producing nice and accessible human-readable output. As we will see, eliminating the hard-coded parts in our function will lead us to a path of Unicode and API design exploration: two topics with a lot to be said by themselves.

We’ll see where that takes us.

[ 2018-04-29 UPDATE: Follow-up article here. ]

Wrap up

This article kicks-off a Python journey, motivated by the topic of counting letters. We’ve built several versions of letter counting functions, progressively simpler, using powerful built-in and Standard Library tools. Knowing them and how they can be applied to a given problem is a valuable skill — like with other skills, mastering it takes time; don’t feel rushed, just keep at it, at your own pace. We’ve also highlighted a few universal coding principles: these are not related to knowing about, or using, any specific tools, they’re not even Python specific.

Summarizing, I would highlight:

Counting things, like many computational problems, looks simple.
It often is, but the need for filtering and processing is common, which introduces complexity.
The devil is in the detail, a very appropriate idiom here: the more requirements, the more prone to complexity a given problem becomes. Creating simple code to solve complex problems is, by definition, not a simple task, but one we should strive for. If possible, agreeing on the simplest requirements possible, is a good idea, while coding with an eye into the future to accommodate change.
The Python Standard Library includes many useful tools, always at hand.
Learning its ways is more of a process than an objective in itself. The collections module contains, among other useful types, the defaultdict and the Counter which we explored and learned about: they both help writing simpler, shorter and faster code. While the Counter is very much a single-purpose thing, the defaultdict has many useful applications, including counting, grouping, and more. The string module — not to be confused with the str built-in type, also very powerful — includes a few little nuggets that, again, we can build upon and use to simplify existing code.
The simpler the code, the better.
Less code is always better code: there’s less to read and understand, less to be executed, less to be maintained and changed in the future. It will also tend to have fewer bugs, assuming that, statistically, bug counts per line are constant, in a given setting. Simpler code also means less branching and looping: linear code — with no ifs, fors or whiles — is easier to grasp and test. We can’t always create it like so, but trying to minimize those is a good long term practice. Simpler code is faster: for both humans and computers.

Thanks for reading. See you soon.

A function or, more generically, a callable: in layman’s terms, a callable is an object we can call by appending parenthesized arguments to it (a few examples: the len function is a callable, being called with len(...); classes are callable, returning objects of that class, like the built-in list which is callable with list(), returning a new list object). ↩
Simply put, a Unicode code point is a number representing a character. To learn more, refer to this rather simplistic Wikipedia article and to the comprehensive Unicode documentation, maybe starting with the glossary. ↩

Take	Small	Medium	Large	Notes
1	1.31μs	22.1μs	2.42ms	Prototype: non-letters counted, upper-/lower-case not combined.
2	4.21μs \| 1.00x	64.5μs \| 1.00x	6.84ms \| 1.00x	First working version.
3	5.18μs \| 1.23x	73.2μs \| 1.13x	7.47ms \| 1.09x	No branching in initializing the `letter_counts` dictionary.
4	5.23μs \| 1.24x	63.5μs \| 0.98x	6.13ms \| 0.90x	Using a `defaultdict` instead of a plain dictionary.
5	6.27μs \| 1.49x	52.8μs \| 0.82x	4.97ms \| 0.73x	No branching in letter filtering, using `str.translate` instead.
6	5.88μs \| 1.40x	28.3μs \| 0.44x	2.26ms \| 0.33x	Leverage `str.translate` to combine upper-/lower-case counts.
7	7.98μs \| 1.90x	22.4μs \| 0.35x	1.65ms \| 0.24x	Don’t loop explicitly, use a `Counter` instead.

tmont.es