a python journey on counting: usability

Tue 29 May 2018
Python
#beginner

Moving on with our journey, in this article we’ll look into possible letter counting API improvements.

To recap, we started exploring the general “counting things” problem in the first article in the series, focused on building a letter histogram from a given text, as a motive to introduce several techniques and useful Standard Library tools. We iterated a few times until we obtained a short, simple and fast letter counting function.

In the second article, with the purpose of generalizing our letter counting function, we delved into unicode land, learning about the unicodedata Standard Library module, and highlighting a few of many language specific nuances we wanted our code to handle. With that knowledge we created several utility functions that, while powerful, are not necessarily easy to use.

What can we do? Let’s see…

About “Usability”

In the previous article we ended up with a set of functions that can be used together to count letters in a generic approach, supporting:

Any unicode alphabet, not being restricted to the Latin one.
Conflating upper/lower-case letter counts.
Conflating accented letter counts with their non-accented counterparts.
Ignoring any given set of symbols, including common whitespace and punctuation.

We broke down the problem into the following major steps:

Create a base alphabet to be processed.
Create a letter map with per-letter operations: ignoring, upper/lower-casing, and accent-stripping.
Process the input text with the letter map.
Count the letters in the processed text.

Looking back at the code, this is what we built:

from collections import Counter
from itertools import chain
import string as s
import sys
from unicodedata import name, category, normalize, combining


# Alphabet creation

def unicode_alphabet_gen(name_start='LATIN', category_start='L'):
    """ ... """
    for codepoint in range(sys.maxunicode):
        letter = chr(codepoint)
        if name(letter, '').startswith(name_start) and category(letter).startswith(category_start):
            yield letter


def unicode_alphabet(name_start='LATIN', category_start='L'):
    """ ... """
    return ''.join(unicode_alphabet_gen(name_start, category_start))


# Letter map operations

def discard(these=s.whitespace+s.punctuation, these_too=''):
    """ ... """
    def operation(letter_map):
        for letter in chain(these, these_too):
            letter_map[letter] = None
    return operation


def change_case(upper=True):
    """ ... """
    def operation(letter_map):
        for letter, mapped in letter_map.items():
            if mapped is None:
                continue
            letter_map[letter] = mapped.upper() if upper else mapped.lower()
    return operation


def strip_accents(exceptions='', normalization='NFD'):
    """ ... """
    def operation(letter_map):
        for letter, mapped in letter_map.items():
            if mapped is None or mapped in exceptions:
                continue
            letter_map[letter] = ''.join(
                l for l in normalize(normalization, mapped)
                if not combining(l)
            )
    return operation


# Letter map creation

def create_letter_map(alphabet=s.ascii_letters, operations=(discard(), change_case())):
    """ ... """
    letter_map = dict(zip(alphabet, alphabet))
    for operation in operations:
        operation(letter_map)
    return letter_map


# Top level letter-counting function

def letter_counts_take9(text, letter_map=create_letter_map()):
    """ ... """
    xlate_table = str.maketrans(letter_map)
    just_letters = text.translate(xlate_table)
    return Counter(just_letters)

For conciseness, I omitted the docstrings we had before. This also serves as an interesting exercise in self-code review: how easily can we grasp the code without documentation? The easier the better — but don’t take this to mean I advocate for code not to be documented, quite the contrary! It’s just an exercise. 😉

What can we say?… Well, it’s certainly good that each function addresses one isolated task: this is always a good design principle. However, using them together, with no reference to documentation or examples, might not be obvious or intuitive to most people. Additionally, note the amount of non-trivial setup a caller may need to complete before getting to the actual letter counting call:

>>> alphabet = unicode_alphabet('CYRILLIC')
>>> operations = (discard(), change_case(), strip_accents(exceptions='Й'))
>>> letter_map = create_letter_map(alphabet, operations)
>>> letter_counts_take9('русский', letter_map)
Counter({'С': 2, 'Р': 1, 'У': 1, 'К': 1, 'И': 1, 'Й': 1})

Wouldn’t it be nice to have a different function signature, maybe using a lang argument, and have it “just work”?

>>> letter_counts_take10('русский', lang='russian')
Counter({'С': 2, 'Р': 1, 'У': 1, 'К': 1, 'И': 1, 'Й': 1})
>>> letter_counts_take10('Ελληνικά', lang='greek')
Counter({'Λ': 2, 'Ε': 1, 'Η': 1, 'Ν': 1, 'Ι': 1, 'Κ': 1, 'Α': 1})
>>> letter_counts_take10('¿Cómo está?', lang='spanish')
Counter({'O': 2, 'C': 1, 'M': 1, 'E': 1, 'S': 1, 'T': 1, 'A': 1})

This is precisely what I mean by “usability”. Finding a good balance between an API’s flexibility (much like what we have built up to now, with lots of power derived from the combination of several individual functions) versus an API’s simplicity (in the sense of making it very clear, natural, and hopefully obvious), is not often an easy challenge, but certainly one that may deserve our attention.

One common solution is to provide two API layers: a higher level one, simpler and more direct; and a lower-level one, more powerful and invariably more complex. Let’s see what we can come up with in trying to have our letter counting function take the lang argument as high-level indicator of the alphabet and associated letter-counting rules a caller is interested in, instead of the less intuitive letter_map used in the current version.

We will start with a module¹ level _LANGUAGES dict:

Its keys will be the “languages” supported by the new letter counting function.
Its values will contain the associated alphabet and letter mapping operations, reusing the utility functions we have created so far.

_LANGUAGES = {
    'russian': lambda: (
        unicode_alphabet('CYRILLIC'),
        (discard(), change_case(), strip_accents(exceptions='Й')),
    ),
    'greek': lambda: (
        unicode_alphabet('GREEK'),
        (discard(), change_case(), strip_accents()),
    ),
    'spanish': lambda: (
        unicode_alphabet('LATIN'),
        (discard(these_too='¿¡'), change_case(), strip_accents()),
    ),
    None: lambda: (
        s.ascii_letters,
        (discard(), change_case()),
    )
}

Note how the values are actually lambdas that, when called, return an (alphabet, operations) tuple. Using the lambdas here ensures that none of the actual, per-language, alphabet/operation functions are called when the module is loaded and _LANGUAGES is defined. It will be up to the code accessing _LANGUAGES to call them, when needed. Otherwise, just loading the module and having _LANGUAGES defined would incur in unnecessary processing which, looking at the unicode_alphabet implementation, would probably not be negligible at all.

Having said that, the _LANGUAGES dictionary feels like a reasonably readable, mostly declarative approach at consolidating per-language processing information, in which the None key represents a default behavior. With that, we can write letter_counts_take10 as:

def letter_counts_take10(text, lang=None):
    """ ... """
    alphabet_ops_callable = _LANGUAGES[lang]
    alphabet, operations = alphabet_ops_callable()
    letter_map = create_letter_map(alphabet, operations)
    xlate_table = str.maketrans(letter_map)
    just_letters = text.translate(xlate_table)
    return Counter(just_letters)

It is still very linear, operating at a slightly higher-level of abstraction, going through each of the major steps: creating an alphabet and its associated operations first, then creating a letter-map and processing the input text with it and, finally, doing the actual letter counting. Of course, it starts off by looking up the passed in lang in the _LANGUAGES module level dictionary which, by itself, will fail by raising a KeyError when the lang entry is not found.

To complement this simpler, higher-level API, we could now create a register_lang function, operating at a slightly lower-level, to expose the full power of the underlying alphabet generation and letter-map operations we have so far, adding/replacing entries in the module level _LANGUAGES dictionary, used by letter_counts_take10:

def register_lang(lang, alphabet, operations):
    """ ... """
    _LANGUAGES[lang] = lambda: (alphabet, operations)

Exercising both functions with a snippet of French text, we would get…

>>> letter_counts_take10('Elle est née', lang='french')
...
KeyError: 'french'

…which is expected. Using the register_lang function, however, we can leverage the full power of the lower-level API…

>>> french_alphabet = unicode_alphabet('LATIN')
>>> french_operations = (discard(), change_case(), strip_accents())
>>> register_lang('french', french_alphabet, french_operations)

…and then:

>>> letter_counts_take10('Elle est née', lang='french')
Counter({'E': 5, 'L': 2, 'S': 1, 'T': 1, 'N': 1})

Are we better off this way? While there’s no universal answer to such question, I would argue that in the general case we are. We now have APIs at two distinct levels:

High-level API

Use the letter_counts_take10 function, passing it the text and optional lang. If it doesn’t support a given language or letter counting set of rules, leverage the low-level API to describe the desired operations, first.
Low-level API.

Use the register_lang function along with the alphabet generation and letter-map operations to alter/define new letter counting sets of rules, building on top of the unicode_alphabet, discard, change_case, and strip_accents existing functions.
In fact, nothing stops the caller from supplying their own custom letter-map operation functions. The API is still pretty wide open.

Turkish Lower-/Upper-Casing

Written Turkish uses the Latin alphabet with, at least, one particular detail that will deserve our attention now:

The letter 'i' upper-cases to 'İ'.
The letter 'I' lower-cases to 'ı'.

However, Python string operations do not handle this particular case-change:

>>> 'i'.upper()         # Should be 'İ'.
'I'
>>> 'I'.lower()         # Should be 'ı'.
'i'
>>> 'ı'.upper()         # This one is correct.
'I'
>>> 'İ'.lower()         # Should be 'i'.
'i̇'

What this means is that our existing change_case function may not be fit for processing Turkish text, given that it uses the str.lower and str.upper methods. However, since we have an open ended low-level API, we can always create a custom turkish_upper_case letter-map operation function, using it along with the existing functions. Here’s a rough take on that:

def turkish_upper_case():
    """ ... """
    exceptions = {'i': 'İ', 'ı': 'I'}
    def operation(letter_map):
        for letter, mapped in letter_map.items():
            if mapped is None:
                continue
            letter_map[letter] = exceptions.get(mapped, mapped.upper())
    return operation

…which could then be used when registering the Turkish language…

>>> turkish_alphabet = unicode_alphabet('LATIN')
>>> turkish_operations = (discard(), turkish_upper_case(), strip_accents(exceptions='ÇĞIİÖŞÜ'))
>>> register_lang('turkish', turkish_alphabet, turkish_operations)

…and then:

>>> letter_counts_take10('Günaydın!', lang='turkish')
Counter({'N': 2, 'G': 1, 'Ü': 1, 'A': 1, 'Y': 1, 'D': 1, 'I': 1})
>>> letter_counts_take10('Diyarbakır', lang='turkish')
Counter({'A': 2, 'R': 2, 'D': 1, 'İ': 1, 'Y': 1, 'B': 1, 'K': 1, 'I': 1})

This short Turkish lower-/upper-casing interlude goes to show that, indeed, our revised API design is better:

We have a simple way to use it, as long as we’re working with a language for which the rules are known.
We have a powerful way to use it, registering new languages and rules, which can even be extended with custom code.

On top of that, design-wise, we have a simple way of supporting as many “built-in” languages as we want, via the declarative approach we took with the _LANGUAGES dictionary.

Other Directions

One other API design consideration which we can take under the “usability” motto, is its ability to handle arbitrarily large input.

From our very first take on the letter counting function, the API expects the full text to be passed in, counting all the letters in one go. Given that it is very plausible that any real-world text would be sourced from either a file or a network connection, does it make sense to require the full text to be loaded into memory before processing it? Isn’t that unnecessarily limiting and wasteful? Of course it all depends on the particular set of requirements and possible use cases we envision, but let’s assume we’d like to be able to drop that requirement.

A possible solution would be to create a LetterCounter class, which would separate the letter processing initialization from the actual letter counting, keeping track of it in a dedicated Counter object, in turn supporting incremental updates:

class LetterCounter(object):

    def __init__(self, text='', lang=None):
        alphabet_ops_callable = _LANGUAGES[lang]
        alphabet, operations = alphabet_ops_callable()
        letter_map = create_letter_map(alphabet, operations)
        self._xlate_table = str.maketrans(letter_map)
        self._counter = Counter()
        self.update(text)

    def update(self, text):
        just_letters = text.translate(self._xlate_table)
        return self._counter.update(just_letters)

    @property
    def counts(self):
        return self._counter

Mostly, what we’ve done here was breaking the letter_counts_take10 in two (plus one):

The __init__ method, with the same argument signature, handles the setup of the alphabet and letter-map operations to create a translation table which is stored in the self._xlate_table attribute for later use by the update method. It then initializes the self._counter attribute to a freshly created collections.Counter object which is updated with the passed in text.
The update method which actually does the counting: first by applying the pre-calculated string translation table to the passed in text, then by using the Counter.update method to actually update the per-letter counts.
For convenience, we also added the counts property to expose the actual counts in a simple and controlled way².

Let’s take it for a spin:

>>> lc = LetterCounter('Hello')
>>> lc.counts
Counter({'L': 2, 'H': 1, 'E': 1, 'O': 1})
>>> lc.update('there!')
>>> lc.counts
Counter({'E': 3, 'H': 2, 'L': 2, 'O': 1, 'T': 1, 'R': 1})

Let’s now compare that result with the “all in one go” call:

>>> LetterCounter('Hello there!').counts
Counter({'E': 3, 'H': 2, 'L': 2, 'O': 1, 'T': 1, 'R': 1})

It works pretty much as expected, good.

This now allows us to rewrite the plain letter counting function on top of it…

def letter_counts_take11(text, lang=None):
    """ ... """
    return LetterCounter(text=text, lang=lang).counts

…as well as a higher level function to count letters in an arbitrarily large file, for example:

def letter_counts_file(filename, encoding='UTF-8', lang=None):
    """ ... """
    lc = LetterCounter(lang=lang)
    with open(filename, encoding=encoding) as f:
        for line in f:
            lc.update(line)
    return lc.counts

The LetterCounter class implementation is very simple and, note, I have opted to leave the alphabet generation and letter-map operation functions, as well as the module level _LANGUAGES dictionary, along with the register_lang function, outside of the class scope. Could each of those have been moved into the class? While they could, I found no big benefit in doing that for two key reasons:

The first one being that that would distract from the fundamental idea and motivation for the class creation: the tracking of state with both the continuously used string translation table and updatable letter counts.
The second one being the fact that, other than having everything “nice and tidy” under the same class, no other immediate benefit would be obtained from that. Keeping our existing functions — and module level _LANGUAGES dictionary — the way they are required no changes and lead to no additional complexity: neither in the existing code nor in foreseeable future changes.

Wrap up

I feel I painted myself a bit into a corner with this article series, and in particular with this specific article. Usability and API design are, by themselves, very wide ranging topics, which I’m conflating a bit here. Nonetheless, being more of a beginner oriented journey and less of a very directed, single-topic, all-encompassing writing, I do believe there is valuable information herewith — at least in motivating readers, raising awareness on the topic. Without much further ado, let’s review the key ideas here:

Usability, as in “fitness to be used for a given purpose” is definitely subjective.

We explored it along two lines: first, striving to deliver a more intuitive letter counting API to casual callers, while still exposing the original, more powerful one; then, supporting incremental text input, not requiring the full text to be in memory at any given time.
Balancing simple APIs with power and flexibility.

This is often achieved by having a simple, zero-boilerplate API that handles common use cases, assuming a given set of default behaviors. We did this by having letter_counts_take10 take a language argument, driving its work from an internal language registry, the _LANGUAGES dictionary. We then exposed a second, more-powerful API, allowing callers with specific needs to leverage the existing code, by registering new languages and their associated letter-counting operations which, due the use of functions as arguments, can even integrate custom code (as the Turkish lower-/upper-casing example has shown).

The fundamental aspect we delivered in simplifying the letter counting function API, lies in the fact that we replaced an argument describing “how to” count letters (the letter_map argument in letter_counts_take9) with a higher level one specifying “what” language rules we want to apply (lang in letter_counts_take10).

In our example, using an intermediary level of indirection between the simple and the advanced API, via the module level _LANGUAGES dictionary, resulted in a clean and easy to manage solution. Many times, creating a level of indirection between two complementary (or completely separate) aspects of the code is a good design principle (maybe we’ll visit this topic in the future).

A useful tool I found a while back is https://python.apichecklist.com. Being a very comprehensive checklist, I’ve found that going through it helps pinpoint improvements to any API I may be evaluating at any given time. Use it as you see fit, either as “strict requirements” or more as “general guidelines”.
Using a class became useful when we wanted to track state across multiple, incremental invocations.

In particular, our implementation tracks two pertinent things: the string translation table — that only needs to be calculated once, and may be computationally costly to create — and, naturally, the running letter count, under a Counter object.

We left everything else out of the class — alphabet generation and letter-map operation functions, the languages registry dictionary, and the register_lang function. Sometimes, just a function is simpler and good enough for a given purpose.

For an interesting, if somewhat provocative talk, you may want to dedicate ~20m watching Jack Diederich’s, PyCon 2012, Stop Writing Classes talk, in which Jack highlights the importance of not writing classes up until they’re really needed — as a self-note, I wonder if we could have created a letter counting API supporting incremental input without creating a class… I guess we could, but in this particular case, the need to track a given letter count along with its associated string translation table, while keeping them somehow together and not stepping into other possible concurrent letter counting, clearly justifies it.

So this is it for now. I expect this installment, while maybe a bit less structured than the ones before it, resonates with the readers somehow, helping them develop their own awareness and acuity in topics like API design, by sharing a few possible ideas on how to achieve that.

Thanks for reading. See you soon.

A “module” is a file containing Python code, normally with a “.py” extension. Thus, by “module level”, I mean a global variable declared in the source Python file we’ve been working with. Such variables are also called “module global”. ↩
Other options would be valid and could even be more useful and powerful, depending on the use cases, including automatically exposing the Counter object’s API at the LetterCounter level, for example. However, this being a beginner directed article, I’ll refrain from exploring more advanced API options, for now. ↩

tmont.es

A Python Journey on Counting: Usability

About “Usability”

Turkish Lower-/Upper-Casing

Other Directions

Wrap up