Promptar Lead | Python Trainer

A Python Journey on Counting: Refinements


In the first installment of this journey we created progressively simple versions of a letter histogram creation function, as a motivation to introduce useful built-in and Standard Library Tools, while sharing general ideas about code simplification. The resulting code was short and fast, meeting all initial requirements. But it should be better.

In this article I’ll explore further refinements to our solution, raising the bar on the requirements: this will steer us into the exploration of topics like Unicode text processing and API design — both tangential to the letter-counting problem, but otherwise with a very wide reach, on their own.


Where were we?

We wrapped the previous article with a pretty short and fast letter-counting function, about which I pointed out:

(…) there is one more thing we should address (…). The fact that the letter filtering and grouping is based on non-changeable, hard-coded values, using string.ascii_lowercase and its friends. This is unnecessarily limiting and, in general, a bad coding practice.

Let’s bring it here, as a refresher:

from collections import Counter
import string as s

def letter_counts_take7(text):
    """ ... """
    xlate_table = str.maketrans(s.ascii_lowercase, s.ascii_uppercase, s.whitespace + s.punctuation)
    just_letters = text.translate(xlate_table)
    return Counter(just_letters)

Hard-coding the string module’s objects as arguments to the str.maketrans call is, indeed, unnecessarily limiting (recall that str.maketrans creates a translation table defining the letter mapping and filtering that str.translate performs):

  • The first two arguments, ascii_lowercase and ascii_uppercase, map ASCII lower-case letters to their upper-case counterparts.
  • The third argument, the concatenation of whitespace and punctuation from the string module, filters out non-letters.

What if we’d like to process Greek or Russian text sources? Such scripts don’t use the latin alphabet — the ASCII lower- to upper-case conversion is useless. It is also pretty limiting with languages that use accented letters (most?), which many could prefer counting under the non-accented, base letter variation. What about Spanish text? In written Spanish, questions are wrapped in '¿?' pairs, as in “¿Cómo está?” for “How are you?” — the whitespace and punctuation filtering doesn’t cope with the inverted question mark.

Let’s then raise the bar on the requirements, avoiding the currently hard-coded values:

  • Include support for non-latin alphabets, combining lower-/upper-case letter counts.
  • Support combining the counts for accented letter variations under a common, base letter count.
  • Ignore any given set of symbols, defaulting to the currently hard-coded ones.


Refinements

The simplest approach in avoiding the hard-coded letter mapping and filtering in our counting function is to directly expose the internal str.maketrans arguments as optional function arguments, using the previously hard-coded values as their defaults — this keeps the function signature backwards compatible, while supporting different letter mapping/filtering operations depending on whether and how the new letters_from, letters_to and discard arguments are used:

from collections import Counter
import string as s

def letter_counts_take8(text,
                        letters_from=s.ascii_lowercase,
                        letters_to=s.ascii_uppercase,
                        discard=s.whitespace+s.punctuation):
    """ ... """
    xlate_table = str.maketrans(letters_from, letters_to, discard)
    just_letters = text.translate(xlate_table)
    return Counter(just_letters)

Being passable as a more flexible solution, indirectly exposing the internal use of str.maketrans not only results in a somewhat odd (maybe debatable) function signature, but also in a subtle, yet potentially important limitation, at least to some people, as we will see.

We could be tempted to say that this is the minimum version that satisfies our newly added requirements: it contains no hard-coded values, supports conflating all sorts of letter variation counts, by passing in adequate letters_from and letters_to values, and ignores arbitrary sets of symbols, via the discard argument. Let’s give it a run:

$ python3 -i letter_counts.py
>>> letter_counts_take8('Hello there!')
Counter({'E': 3, 'H': 2, 'L': 2, 'O': 1, 'T': 1, 'R': 1})

The default form works. How about Greek and latin accented-letter handling?

>>> letter_counts_take8('Ελληνικά', letters_from='άεηικλν', letters_to='ΑΕΗΙΚΛΝ')
Counter({'Λ': 2, 'Ε': 1, 'Η': 1, 'Ν': 1, 'Ι': 1, 'Κ': 1, 'Α': 1})
>>> letter_counts_take8('Elle est née', letters_from='eélnst', letters_to='EELNST')
Counter({'E': 5, 'L': 2, 'S': 1, 'T': 1, 'N': 1})

Looks good. We can even ignore non-default punctuation, passing in a custom discard string value:

>>> letter_counts_take8('¿Bien?', discard=' ¿?')
Counter({'B': 1, 'I': 1, 'E': 1, 'N': 1})
>>> letter_counts_take8('¿Cómo está?', letters_from='áeoómst', letters_to='AEOOMST', discard=' ¿?')
Counter({'O': 2, 'C': 1, 'M': 1, 'E': 1, 'S': 1, 'T': 1, 'A': 1})

All these simple tests work but, alas, building the letters_from and letters_to string values is not necessarily an easy task — expecting callers to pass in all accented-letter variations in a given language, for example, mapping each to the non-accented counterpart, feels error prone, to say the least, and very difficult (if not impossible!) to type in using the keyboard; including additional all-around upper- or lower-case mappings on top of that quickly leads to huge, difficult to maintain string arguments. In other words, it works, but it’s messy.

Then, there is one more thing: the updated function assumes that the customizable letter mapping is done on a one-to-one basis, requiring the letters_from and letters_to string arguments to be the same length. This, however, does not always hold true: the German lower-case 'ß' is upper-cased to the double-letter 'SS', for example, and single letter ligatures, like 'fi', are upper-cased to two letters, as 'FI'.

We’re clearly facing two challenges ahead, both within the realm of API design:

  • On one hand, we’d like to make life easier for callers wanting to use text with accented- and/or non-latin alphabets.
  • On the other, we must cope with the fact that some possible letter mappings lead to more than one resulting letter.

These boil down to the letters_from and letters_to argument semantics. They don’t serve us well. Let’s look for some inspiration.

A dive into Unicode

With the purpose of standardizing all written text, Unicode is extremely valuable but unavoidably complex — this stems from the fact that written text is in itself pretty convoluted, having been created and constantly evolving throughout many many centuries, across very distinctive cultures and, like most human things, following somewhat chaotic evolution patterns. For some, letters correspond to sounds, for others, they represent things, concepts, ideas. Some languages have no upper-/lower-case distinction and many use varying sets of accents, marks or diacritics to create variations of a base letter. Odd-cases like a given lower-case letter or two-letter ligature corresponding to two separate upper-case letters in some languages, but not in others, represent challenges in Unicode and, as much as things can be normalized, challenges to those striving to (correctly) work or process written text in any language.

Python has very good Unicode support, with strings1 being processed as Unicode text and common operations on all sorts of scripts and characters working as expected. Moreover, the unicodedata Standard Library module includes many useful functions which, as we will see, can be very handy when processing Unicode text.

Let’s play around a bit and find out what we can learn:

>>> 'resumé'.upper()        # Upper-casing accented latin letters is easy.
'RESUMÉ'
>>> 'ελληνικά'.upper()      # The same goes for 'greek' in Greek...
'ΕΛΛΗΝΙΚΆ'
>>> 'русский'.upper()        # ...and 'russian' in Russian.
'РУССКИЙ'
>>> 'straße'.upper()        # The German 'ß' correctly upper-cases to 'SS'.
'STRASSE'
>>> '日本'.upper()          # 'japanese' in Japanese: no concept of upper-case.
'日本'
>>> 'עברית'.upper()         # 'hebrew' in Hebrew: likewise, no concept of upper-case.
'עברית'

One interesting aspect defined by Unicode is the fact that letters and symbols have names themselves. The unicodedata.name function, returns the name of the Unicode letter or symbol passed in as an argument:

>>> from unicodedata import name
>>> name('a')
'LATIN SMALL LETTER A'
>>> name('é')
'LATIN SMALL LETTER E WITH ACUTE'
>>> name('β')
'GREEK SMALL LETTER BETA'
>>> name('с')                       # Looks like a lowercase latin 'C', but it's not!
'CYRILLIC SMALL LETTER ES'
>>> name('ß')                       # German Eszett: looks like Greek lowercase Beta.
'LATIN SMALL LETTER SHARP S'
>>> name('本')
'CJK UNIFIED IDEOGRAPH-672C'
>>> name('ב')
'HEBREW LETTER BET'

The reverse can be done with either unicodedata.lookup, or built-in string '\N{…}' escapes, which many people are not aware of:

>>> from unicodedata import lookup
>>> lookup('LATIN CAPITAL LETTER A WITH GRAVE')     # Raises 'KeyError' with an unknown name.
'À'
>>> '\N{LATIN CAPITAL LETTER A WITH GRAVE}'         # Raises 'SyntaxError' with an unknown name.
'À'
>>> 'R = 4.7k\N{GREEK CAPITAL LETTER OMEGA}'
'R = 4.7kΩ'
>>> '\N{WINKING FACE}'                              # Yes, this works too.
'😉'

Another Unicode concept is the category of a given character, which can be obtained with the unicodedata.category function:

>>> from unicodedata import category
>>> category('ş')
'Ll'                            # Letter, lowercase.
>>> category('T')
'Lu'                            # Letter, uppercase.
>>> unicodedata.category('本')
'Lo'                            # Letter, other.
>>> category('7')
'Nd'                            # Number, decimal digit.
>>> category(' ')
'Zs'                            # Separator, space.

Knowing that all latin lower-case letters have Unicode names starting with 'LATIN' and the category 'Ll', we could create a super-charged string.ascii_lowercase-like generator function, producing all the Latin lowercase letters in Unicode:

from unicodedata import name, category
import sys

def latin_lowercase_gen():
    """Generates all lowercase Latin letters in Unicode codepoint order."""
    for codepoint in range(sys.maxunicode):
        letter = chr(codepoint)
        if name(letter, '').startswith('LATIN') and category(letter)=='Ll':
            yield letter

Testing it, at the interactive prompt:

>>> latin_lower = ''.join(latin_lowercase_gen())
>>> len(latin_lower)
676
>>> latin_lower[:80]
'abcdefghijklmnopqrstuvwxyzßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿāăąćĉċčďđēĕėęěĝğġģĥħĩī'

Removing hard-coded literals, and adjusting the category match, defaulting to letters in any sub-category, we obtain a more generic:

from unicodedata import name, category
import sys

def unicode_alphabet_gen(name_start='LATIN', category_start='L'):
    """Generates Unicode letters/symbols with name/category per the arguments."""
    for codepoint in range(sys.maxunicode):
        letter = chr(codepoint)
        if name(letter, '').startswith(name_start) and category(letter).startswith(category_start):
            yield letter

It can then be used to generate alphabet strings:

>>> latin_lower = ''.join(unicode_alphabet_gen('LATIN', 'Ll'))
>>> greek_lower = ''.join(unicode_alphabet_gen('GREEK', 'Ll'))
>>> cyrillic_lower = ''.join(unicode_alphabet_gen('CYRILLIC', 'Ll'))
>>> tamil = ''.join(unicode_alphabet_gen('TAMIL'))
>>> hebrew = ''.join(unicode_alphabet_gen('HEBREW'))
>>>
>>> for alphabet in (latin_lower, greek_lower, cyrillic_lower, tamil, hebrew):
...     print(f'{len(alphabet):6}: {alphabet[:80]!r}')
...
   676: 'abcdefghijklmnopqrstuvwxyzßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿāăąćĉċčďđēĕėęěĝğġģĥħĩī'
   188: 'ͱͳͷͻͼͽΐάέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώϐϑϕϖϗϙϛϝϟϡϰϱϲϳϵϸϻϼᴦᴧᴨᴩᴪἀἁἂἃἄἅἆἇἐἑἒἓἔἕἠ'
   195: 'абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџѡѣѥѧѩѫѭѯѱѳѵѷѹѻѽѿҁҋҍҏґғҕҗҙқҝҟҡңҥҧ'
    37: 'ஃஅஆஇஈஉஊஎஏஐஒஓஔகஙசஜஞடணதநனபமயரறலளழவஶஷஸஹௐ'
    74: 'אבגדהוזחטיךכלםמןנסעףפץצקרשתװױײיִײַﬠﬡﬢﬣﬤﬥﬦﬧﬨשׁשׂשּׁשּׂאַאָאּבּגּדּהּוּזּטּיּךּכּלּמּנּסּףּפּצּקּרּשּתּוֹבֿכֿפֿﭏ'

Interesting… If it weren’t for the fact that we already know that our letter counting API will need to use something other than the letters_from and letters_to arguments as equal length strings, we could use variations of this to create them. There is, recall, a fundamental limitation in that approach — upper-casing latin letters, for example, is not a one-letter-to-one-letter process:

>>> latin_upper = latin_lower.upper()
>>> len(latin_lower), len(latin_upper)
(676, 693)

Apparently, the 676 lower-case latin letters, upper-case to 693 letters, hmmm… This doesn’t come completely as a surprise and goes to show that text processing is not necessarily obvious.

Putting that aside, for a moment, assuming that we would use the unicode_alphabet_gen generator function to build some kind of letter source map, like a letters_from string, another challenge comes up. How could we map accented letters to their non-accented counterparts? We’ve been mentioning the topic, here and there, and now is the time to explore it.

One thing the Unicode standard defines are normalization forms. In layman’s terms, such normalization is necessary due to the fact that there are multiple valid representations for many letters and symbols standing for the same underlying abstract character — without normalization, comparing, matching or sorting text would be impossible. Let’s take a look at two, single-letter examples:

>>> c1 = '\N{LATIN SMALL LETTER C WITH CEDILLA}'
>>> c2 = '\N{LATIN SMALL LETTER C}\N{COMBINING CEDILLA}'
>>> c1
'ç'
>>> c2
'ç'
>>> c1 == c2
False
>>>
>>> a1 = '\N{ANGSTROM SIGN}'
>>> a2 = '\N{LATIN CAPITAL LETTER A WITH RING ABOVE}'
>>> a3 = '\N{LATIN CAPITAL LETTER A}\N{COMBINING RING ABOVE}'
>>> a1
'Å'
>>> a2
'Å'
>>> a3
'Å'
>>> a1 == a2, a2 == a3, a3 == a1
(False, False, False)
>>>

Note how c1 and c2 look the same but don’t compare equal. The same goes for a1, a2 and a3, all different forms of the same thing. More, note that len(c1) is 1 but len(c2) is 2, and that both len(a1) and len(a2) are 1, while len(a3) is 2. What’s going on here? Different valid representations for the same thing, that’s what it is. Now let’s see the Unicode normalization in action:

>>> from unicodedata import normalize
>>> normalize('NFD', c1) == normalize('NFD', c2)
True
>>> normalize('NFD', a1) == normalize('NFD', a2)
True
>>> normalize('NFD', a2) == normalize('NFD', a3)
True
>>> normalize('NFD', a3) == normalize('NFD', a1)
True

Ok, what else is there to know about it? Let’s check the len of each normalized result:

>>> len(normalize('NFD', c1)), len(normalize('NFD', c2))
(2, 2)
>>> len(normalize('NFD', a1)), len(normalize('NFD', a2)), len(normalize('NFD', a3))
(2, 2, 2)

That makes sense, given the normalize results compare equally. Let’s use the unicodedata.name function to get a better grasp:

>>> from unicodedata import name
>>> for s in (c1, c2, a1, a2, a3):
...     normalized_s = normalize('NFD', s)
...     print([name(each) for each in normalized_s])
['LATIN SMALL LETTER C', 'COMBINING CEDILLA']
['LATIN SMALL LETTER C', 'COMBINING CEDILLA']
['LATIN CAPITAL LETTER A', 'COMBINING RING ABOVE']
['LATIN CAPITAL LETTER A', 'COMBINING RING ABOVE']
['LATIN CAPITAL LETTER A', 'COMBINING RING ABOVE']

The 'NFD' normalization brought each representation to a common, decomposed format, where the “base letters” are “combined” with subsequent elements. There are, as the example suggests, several “combining” symbols in Unicode which the unicodedata.combining function helps identify — in very simple terms, again, the function returns non-zero values for any symbol that is a Unicode combining symbol. Let’s check that:

>>> c1
'ç'
>>> normalized_c1 = normalize('NFD', c1)
>>> normalized_c1
'ç'
>>> len(c1), len(normalized_c1)
(1, 2)
>>> from unicodedata import combining
>>> [combining(each) for each in normalized_c1]
[0, 202]

Getting back on topic, can we automate the process of mapping accented letters to their non-accented counterparts? I’d say, generally, we can (keeping in mind that in some languages such an approach may need adjustments or not make sense at all). The idea is:

  • Normalize the text to a decomposed form.
  • Ignore all combining symbols in the normalized result.

Which could be coded as:

from unicodedata import normalize, combining

def strip_accents(text, normalization='NFD'):
    """Strips combining symbols after Unicode normalization."""
    normalized_text = normalize(normalization, text)
    return ''.join(each for each in normalized_text if not combining(each))

Let’s take it for a spin, using the strings we initially played with:

>>> strip_accents('resumé')
'resume'
>>> strip_accents('ελληνικά')
'ελληνικα'
>>> strip_accents('русский')
'русскии'
>>> strip_accents('straße')
'straße'
>>> strip_accents('日本')
'日本'
>>> strip_accents('עברית')
'עברית'

Nice! Note how it correctly stripped accents from Latin, Greek and Cyrillic text, leaving other letters and scripts completely unchanged.

With what we’ve seen about Unicode and the Standard Library’s unicodedata module, we could say we are in a much better position to address the updated requirements. At least we have a pretty good idea on how to process non-Latin alphabets, map accented letters to their non-accented counterparts and, if we’d like, automate it somehow, turning the letter counting function into something easier to use.

What we haven’t solved yet is the fact that some single-letter mappings need to resolve to multi-letter results — this is something the current function signature does not support. Let’s get back to it:

from collections import Counter
import string as s

def letter_counts_take8(text,
                        letters_from=s.ascii_lowercase,
                        letters_to=s.ascii_uppercase,
                        discard=s.whitespace+s.punctuation):
    """ ... """
    xlate_table = str.maketrans(letters_from, letters_to, discard)
    just_letters = text.translate(xlate_table)
    return Counter(just_letters)

The limiting factor is the way str.maketrans is being called — its first two arguments must be strings of equal length, limiting the single-to-multi letter mapping process. Taking a quick look at its documentation — always available via a quick help(str.maketrans) at the Python interactive prompt — we see that there are other ways it can be called. In particular, this one, with simplifying omissions, stands out:

If there is only one argument, it must be a dictionary mapping (…) characters (strings of length 1) to (…) strings (of arbitrary lengths) (…).

This is precisely what we’re looking for. Let’s test it out:

>>> import string as s
>>> uppercase_map = dict(zip(s.ascii_lowercase, s.ascii_uppercase))
>>> uppercase_map['ß'] = 'SS'
>>> uppercase_xlate = str.maketrans(uppercase_map)
>>> 'straße'.translate(uppercase_xlate)
'STRASSE'

We first created an ASCII lower- to upper-case mapping dictionary to which we added one additional mapping, from 'ß' to 'SS'; we then built a translation table with str.maketrans, used for the 'straße' translation, which worked correctly. What about filtering? Well, if you recall, the str.translate docstring says that “Characters mapped to None are deleted” — let’s test that too, updating our custom map:

>>> uppercase_map[' '] = None
>>> complete_xlate = str.maketrans(uppercase_map)
>>> 'breite straße'.translate(complete_xlate)
'BREITESTRASSE'

It works as expected, great. So maybe the letters_from and letters_to arguments in letter_counts_take8 can be replaced with a single letter_map, which should be a letter-to-letter(s) dictionary. More, knowing that such map can also convey information about letters to be discarded (mapping them to None), we no longer need the discard argument. This leads us to a very bare-bones:

def letter_counts_take9(text, letter_map=...):
    ...

The problem of creating a letter_map is still up to the caller — pretty much the same we had before, that required callers to create the letters_from and letters_to arguments. Even though the function signature is simplified, supporting more general use cases, the letter map creation problem needs to be addressed. Let’s focus on that, for a while.

Creating a letter map

A letter map is a dictionary mapping letters, as strings of length 1, to either (i) other strings, most often of length 1, or to (ii) None. This way, any conceivable letter count grouping seems to be possible, including upper-casing, accent stripping and, arbitrary symbol and whitespace discarding.

With these use cases in mind, a possible letter map creation API could look like:

import string as s

def create_letter_map(strip_accents=True, upper_case=True, discard=s.whitespace+s.punctuation):
    ...

It would return a letter map that optionally strips accents via Unicode normalization, optionally upper-cases all letters and, per the requirements, defaults to discarding the proper characters. Is this enough? Is it too much? For starters, with no other hints, it would need to create a map for all existing Unicode letters — maybe that’s too much. Then, if you recall the distinction of the Russian 'й' vs. 'и', which our accent stripping approach conflated, there probably needs to be some way of specifying “exceptions” to that, whichever they are — so maybe this is not enough.

We could be tempted to add more arguments to the letter map creation function — one specifying the “alphabet” to be mapped, another specifying accent-stripping exceptions to handle cases like the Russian one we just referred to. Then, along the same lines, we might want to consider exceptions to upper-casing (why not?), or a lower-casing option, etc. Hmmm, complexity is creeping in… What about the discard argument? When should discarding take place? In the original text or after the optional upper-casing or accent-stripping? And which mapping operation should take place first, if requested, upper-casing or accent-stripping? We could keep adding arguments to the letter map creation function to handle all these possibilities, but it would certainly be a long, probably not easy to use API, where any future extension would only increase complexity. Let’s see if we can do better than that, while being flexible and striving to be simple.

Picturing the letter map creation as a sequence of operations over an “alphabet”, we will be probably be better off width something like:

import string as s

def create_letter_map(alphabet, operations)
    ...

Maybe alphabet should be a string, defaulting to string.ascii_letters, but that would still leave the callers with the problem of initially creating alternative ones, partially defeating the whole purpose of our function. Then, we would like operations to be somehow generic. Will we need to break the letter map creation down into two sub-problem categories — alphabet creation on one hand, and operations over such alphabets, on the other? That’s probably a sound idea.

A possible approach is to go with…

  • alphabet - A string representing the base alphabet for the map.
  • operations - A sequence of functions that manipulate the map.

…which we could envision being used as in…

>>> alphabet = cyrillic_alphabet()
>>> operations = [discard(), change_case(upper=False), strip_accents(exceptions='й')]
>>> letter_map = create_letter_map(alphabet, operations)

…where cyrillic_alphabet() would return a string with all letters in the Cyrillic alphabet, and the operations list functions would represent the respective mapping operations — each with its default behaviors, configurable via its arguments. With that, the letter map creation function can create an identity map from the alphabet string argument and then consecutively apply operations to it:

def create_letter_map(alphabet, operations):
    """ ... """
    letter_map = dict(zip(alphabet, alphabet))
    for operation in operations:
        operation(letter_map)
    return letter_map

Two important notes on this approach:

  • It will probably be a good idea to have sane, usable default values for alphabet and operations.
    We’ll get back to that, once we have a better grasp of the whole picture.
  • Observe that create_letter_map calls each operation in operations as a function, passing each of them the letter_map argument, to be manipulated. Then observe that the API we envisioned above with operations = [discard(), ...] used function calls as operations themselves — this means that each operation function (discard, change_case, strip_accents, …) must return a function taking a letter_map argument to process.

Alphabet Creation

For the alphabet creation we will reuse the unicode_alphabet_gen function we played with before, copying it here and adding a short wrapper function that creates a string from it — this is important because the result of unicode_alphabet_gen can only be iterated once, but our create_letter_map function iterates twice over alphabet to create the initial identity map (with dict(zip(...))):

from unicodedata import name, category
import sys

def unicode_alphabet_gen(name_start='LATIN', category_start='L'):
    """Generates Unicode letters/symbols with name/category per the arguments."""
    for codepoint in range(sys.maxunicode):
        letter = chr(codepoint)
        if name(letter, '').startswith(name_start) and category(letter).startswith(category_start):
            yield letter

def unicode_alphabet(name_start='LATIN', category_start='L'):
    """
    Returns a string with all Unicode characters with names starting
    with `name_start` and categories starting with `category_start`.
    """
    return ''.join(unicode_alphabet_gen(name_start, category_start))

Map Operations

For map operations we must create functions that return functions — a flexible construct supporting our envisioned API design. Recall, each operation function must return a function taking a single letter_map argument that, when called, manipulates it in some way:

from itertools import chain
import string as s
from unicodedata import normalize, combining

def discard(these=s.whitespace+s.punctuation, these_too=''):
    """
    Returns a function that, when called, changes `letter_map` to discard
    all characters in the `these` and `these_too` argument strings.
    """
    def operation(letter_map):
        for letter in chain(these, these_too):
            letter_map[letter] = None
    return operation

def change_case(upper=True):
    """
    Returns a function that, when called, changes `letter_map` such that it
    upper-cases all mappings if `upper`, else lower-cases all mappings.
    """
    def operation(letter_map):
        for letter, mapped in letter_map.items():
            if mapped is None:
                continue
            letter_map[letter] = mapped.upper() if upper else mapped.lower()
    return operation

def strip_accents(exceptions='', normalization='NFD'):
    """
    Returns a function that, when called, changes `letter_map` such that it
    strips Unicode combining characters via the specified `normalization` to
    all mapped letters except those mapped to a letter in `exceptions`.
    """
    def operation(letter_map):
        for letter, mapped in letter_map.items():
            if mapped is None or mapped in exceptions:
                continue
            letter_map[letter] = ''.join(
                l for l in normalize(normalization, mapped)
                if not combining(l)
            )
    return operation

Could this be enough? Let’s take it for a spin:

$ python3 -i letter_counts.py
>>> alphabet = unicode_alphabet('LATIN')
>>> operations = [discard(), change_case(), strip_accents()]
>>> letter_map = create_letter_map(alphabet, operations)
>>> mapped = [letter_map.get(letter, letter) for letter in 'aBÇ dé Rßt!']
>>> mapped
['A', 'B', 'C', None, 'D', 'E', None, 'R', 'SS', 'T', None]

It’s looking good. Let’s wrap up the letter map creation by setting plain defaults, in line with the behavior we’ve had since our first working implementation — grouping ASCII letters into their upper-case variants, discarding common whitespace and punctuation:

import string as s

def create_letter_map(alphabet=s.ascii_letters, operations=(discard(), change_case())):
    """ ... """
    letter_map = dict(zip(alphabet, alphabet))
    for operation in operations:
        operation(letter_map)
    return letter_map

Having a working solution for the letter map creation problem — not a trivial problem when striving to be generic — we can get back to our letter counting function, changing it to use a default letter map:

def letter_counts_take9(text, letter_map=create_letter_map()):
    """ ... """
    xlate_table = str.maketrans(letter_map)
    just_letters = text.translate(xlate_table)
    return Counter(just_letters)

This way, it can still be called as before…

$ python3 -i letter_counts.py
>>> letter_counts_take9('Hello there!')
Counter({'E': 3, 'H': 2, 'L': 2, 'O': 1, 'T': 1, 'R': 1})

… and now, using the letter map creation helper functions, we can ask it to process all Unicode alphabets, upper-casing all letters, stripping accents with the exception of the Russian 'Й', and discarding the Spanish '¿', with:

>>> alphabet = unicode_alphabet(name_start='')
>>> operations = [discard(these_too='¿'), change_case(), strip_accents(exceptions='Й')]
>>> letter_map = create_letter_map(alphabet, operations)
>>> letter_counts_take9('Ελληνικά', letter_map)
Counter({'Λ': 2, 'Ε': 1, 'Η': 1, 'Ν': 1, 'Ι': 1, 'Κ': 1, 'Α': 1})
>>> letter_counts_take9('Elle est née', letter_map)
Counter({'E': 5, 'L': 2, 'S': 1, 'T': 1, 'N': 1})
>>> letter_counts_take9('¿Cómo está?', letter_map)
Counter({'O': 2, 'C': 1, 'M': 1, 'E': 1, 'S': 1, 'T': 1, 'A': 1})
>>> letter_counts_take9('русский', letter_map)
Counter({'С': 2, 'Р': 1, 'У': 1, 'К': 1, 'И': 1, 'Й': 1})
>>> letter_counts_take9('straße', letter_map)
Counter({'S': 3, 'T': 1, 'R': 1, 'A': 1, 'E': 1})

So where do we stand? Does letter_counts_take9 meet the updated requirements, avoiding the hard-coded values we started off with? It certainly does, but we had to go on quite a long ride to achieve that, while trying to keep the complexity of text handling under control. Changing the letter counting function was a simple step, but that required creating several helper functions to support useful letter count groupings, including going for generic operations in the process of letter map creation — this leaves the API open to future needs without requiring changes to the existing code, which is always good.

The general use of the letter counting function became a multi-step process, however — we first need to create a letter map, which requires creating an alphabet and a sequence of operations, and only then can letters be counted. Is it more difficult to use or less intuitive? Some would say it is. Maybe the single purpose functions we created could be brought together under a LetterCounter class exposing a simpler, more immediate API while still supporting the full power of using each such function individually; a class based approach could also handle arbitrarily large text strings without needing to keep them in memory all at once — maybe being progressively fed data from a file or the network. This could certainly be useful in some cases.

For now we’ll wrap up our progress. We’re not yet at the point where we’re ready to explore output production options, though (will we ever get there?). But we’re still not totally happy with the API we built — it’s powerful but maybe too low-level for common usage. Let’s explore the idea of bringing together what we have created so far under a Python class in the next article and see what we can learn from it.

[ 2018-05-29 UPDATE: Follow-up article here. ]


Wrap up

We started off this article with a single motivation: eliminating hard-coded values from the letter counting function, which is always a good coding principle. In doing that, we decided to raise the bar on the requirements, looking for a more general text processing solution. We then explored Unicode and the unicodedata Standard Library module, which we used to build fundamental helper functions. Along the way, we considered the letter counting function API which, once more general, might have become a little less intuitive.

To conclude, I would highlight:

  • Text processing is not necessarily obvious, even for a task as seemingly simple as letter counting.
    Unicode is complex and powerful. Understanding its fundamentals is crucial for anyone wanting to have at least a chance of handling text correctly. That is not enough, however: like in many other cases, specific domain knowledge is also a critical necessity, as we have illustrated with the Russian separation of 'Й' and 'И'.

  • Beware of raising the bar on requirements.
    In the process of eliminating hard-coded values from our code, we decided to raise the bar on the letter counting function requirements. This, of course, made sense in our journey’s context, motivating the exploration of solutions to multiple challenges. We also raised the idea of a potentially simpler class-based API revamp, again inducing more requirement changes. Would any of these be a good idea on a real world scenario? Maybe, maybe not. It depends on the real-world problem at hand and on how future-proof we want our code to be. Creating single-purpose, generic functions tends to be a good coding principle. Striving to be too generic may not be worth the cost and, more often than not, we won’t be able to guess which change the world will impose on us next. This is something that you will have to take into consideration every time and decide for yourself, on a case by case basis.

  • Avoiding hard-coded values is a good principle.
    This does not mean that every such value should be changeable, via function arguments or any other mechanisms. While doing that with our code actually lead to a more generic and useful letter counting function, in other cases, it could make no sense at all: the HTTP response code for “not found” is 404, a standardized constant. What you should consider is, though, using a properly named (pseudo-) constant, like NOT_FOUND, instead of spreading 404s throughout your code. This will improve code readability a lot.

Thanks for reading. See you next time.


  1. This is not the case in Python 2 and the str type: to process Unicode text in Python 2 the unicode type should be used instead.