Moving on with our journey, in this article we’ll look into possible letter counting API improvements.
To recap, we started exploring the general “counting things” problem in the first article in the series, focused on building a letter histogram from a given text, as a motive to introduce several techniques and useful Standard Library tools. We iterated a few times until we obtained a short, simple and fast letter counting function.
In the second article, with the purpose of generalizing our letter counting function, we delved into unicode land, learning about the unicodedata
Standard Library module, and highlighting a few of many language specific nuances we wanted our code to handle. With that knowledge we created several utility functions that, while powerful, are not necessarily easy to use.
What can we do? Let’s see…
About “Usability”
In the previous article we ended up with a set of functions that can be used together to count letters in a generic approach, supporting:
- Any unicode alphabet, not being restricted to the Latin one.
- Conflating upper/lower-case letter counts.
- Conflating accented letter counts with their non-accented counterparts.
- Ignoring any given set of symbols, including common whitespace and punctuation.
We broke down the problem into the following major steps:
- Create a base alphabet to be processed.
- Create a letter map with per-letter operations: ignoring, upper/lower-casing, and accent-stripping.
- Process the input text with the letter map.
- Count the letters in the processed text.
Looking back at the code, this is what we built:
from collections import Counter
from itertools import chain
import string as s
import sys
from unicodedata import name, category, normalize, combining
# Alphabet creation
def unicode_alphabet_gen(name_start='LATIN', category_start='L'):
""" ... """
for codepoint in range(sys.maxunicode):
letter = chr(codepoint)
if name(letter, '').startswith(name_start) and category(letter).startswith(category_start):
yield letter
def unicode_alphabet(name_start='LATIN', category_start='L'):
""" ... """
return ''.join(unicode_alphabet_gen(name_start, category_start))
# Letter map operations
def discard(these=s.whitespace+s.punctuation, these_too=''):
""" ... """
def operation(letter_map):
for letter in chain(these, these_too):
letter_map[letter] = None
return operation
def change_case(upper=True):
""" ... """
def operation(letter_map):
for letter, mapped in letter_map.items():
if mapped is None:
continue
letter_map[letter] = mapped.upper() if upper else mapped.lower()
return operation
def strip_accents(exceptions='', normalization='NFD'):
""" ... """
def operation(letter_map):
for letter, mapped in letter_map.items():
if mapped is None or mapped in exceptions:
continue
letter_map[letter] = ''.join(
l for l in normalize(normalization, mapped)
if not combining(l)
)
return operation
# Letter map creation
def create_letter_map(alphabet=s.ascii_letters, operations=(discard(), change_case())):
""" ... """
letter_map = dict(zip(alphabet, alphabet))
for operation in operations:
operation(letter_map)
return letter_map
# Top level letter-counting function
def letter_counts_take9(text, letter_map=create_letter_map()):
""" ... """
xlate_table = str.maketrans(letter_map)
just_letters = text.translate(xlate_table)
return Counter(just_letters)
What can we say?… Well, it’s certainly good that each function addresses one isolated task: this is always a good design principle. However, using them together, with no reference to documentation or examples, might not be obvious or intuitive to most people. Additionally, note the amount of non-trivial setup a caller may need to complete before getting to the actual letter counting call:
>>> alphabet = unicode_alphabet('CYRILLIC')
>>> operations = (discard(), change_case(), strip_accents(exceptions='Й'))
>>> letter_map = create_letter_map(alphabet, operations)
>>> letter_counts_take9('русский', letter_map)
Counter({'С': 2, 'Р': 1, 'У': 1, 'К': 1, 'И': 1, 'Й': 1})
Wouldn’t it be nice to have a different function signature, maybe using a lang
argument, and have it “just work”?
>>> letter_counts_take10('русский', lang='russian')
Counter({'С': 2, 'Р': 1, 'У': 1, 'К': 1, 'И': 1, 'Й': 1})
>>> letter_counts_take10('Ελληνικά', lang='greek')
Counter({'Λ': 2, 'Ε': 1, 'Η': 1, 'Ν': 1, 'Ι': 1, 'Κ': 1, 'Α': 1})
>>> letter_counts_take10('¿Cómo está?', lang='spanish')
Counter({'O': 2, 'C': 1, 'M': 1, 'E': 1, 'S': 1, 'T': 1, 'A': 1})
This is precisely what I mean by “usability”. Finding a good balance between an API’s flexibility (much like what we have built up to now, with lots of power derived from the combination of several individual functions) versus an API’s simplicity (in the sense of making it very clear, natural, and hopefully obvious), is not often an easy challenge, but certainly one that may deserve our attention.
One common solution is to provide two API layers: a higher level one, simpler and more direct; and a lower-level one, more powerful and invariably more complex. Let’s see what we can come up with in trying to have our letter counting function take the lang
argument as high-level indicator of the alphabet and associated letter-counting rules a caller is interested in, instead of the less intuitive letter_map
used in the current version.
We will start with a module1 level _LANGUAGES
dict:
- Its keys will be the “languages” supported by the new letter counting function.
- Its values will contain the associated alphabet and letter mapping operations, reusing the utility functions we have created so far.
_LANGUAGES = {
'russian': lambda: (
unicode_alphabet('CYRILLIC'),
(discard(), change_case(), strip_accents(exceptions='Й')),
),
'greek': lambda: (
unicode_alphabet('GREEK'),
(discard(), change_case(), strip_accents()),
),
'spanish': lambda: (
unicode_alphabet('LATIN'),
(discard(these_too='¿¡'), change_case(), strip_accents()),
),
None: lambda: (
s.ascii_letters,
(discard(), change_case()),
)
}
Note how the values are actually lambdas that, when called, return an (alphabet, operations) tuple. Using the lambdas here ensures that none of the actual, per-language, alphabet/operation functions are called when the module is loaded and _LANGUAGES
is defined. It will be up to the code accessing _LANGUAGES
to call them, when needed. Otherwise, just loading the module and having _LANGUAGES
defined would incur in unnecessary processing which, looking at the unicode_alphabet
implementation, would probably not be negligible at all.
Having said that, the _LANGUAGES
dictionary feels like a reasonably readable, mostly declarative approach at consolidating per-language processing information, in which the None
key represents a default behavior. With that, we can write letter_counts_take10
as:
def letter_counts_take10(text, lang=None):
""" ... """
alphabet_ops_callable = _LANGUAGES[lang]
alphabet, operations = alphabet_ops_callable()
letter_map = create_letter_map(alphabet, operations)
xlate_table = str.maketrans(letter_map)
just_letters = text.translate(xlate_table)
return Counter(just_letters)
It is still very linear, operating at a slightly higher-level of abstraction, going through each of the major steps: creating an alphabet and its associated operations first, then creating a letter-map and processing the input text with it and, finally, doing the actual letter counting. Of course, it starts off by looking up the passed in lang
in the _LANGUAGES
module level dictionary which, by itself, will fail by raising a KeyError
when the lang
entry is not found.
To complement this simpler, higher-level API, we could now create a register_lang
function, operating at a slightly lower-level, to expose the full power of the underlying alphabet generation and letter-map operations we have so far, adding/replacing entries in the module level _LANGUAGES
dictionary, used by letter_counts_take10
:
def register_lang(lang, alphabet, operations):
""" ... """
_LANGUAGES[lang] = lambda: (alphabet, operations)
Exercising both functions with a snippet of French text, we would get…
>>> letter_counts_take10('Elle est née', lang='french')
...
KeyError: 'french'
…which is expected. Using the register_lang
function, however, we can leverage the full power of the lower-level API…
>>> french_alphabet = unicode_alphabet('LATIN')
>>> french_operations = (discard(), change_case(), strip_accents())
>>> register_lang('french', french_alphabet, french_operations)
…and then:
>>> letter_counts_take10('Elle est née', lang='french')
Counter({'E': 5, 'L': 2, 'S': 1, 'T': 1, 'N': 1})
Are we better off this way? While there’s no universal answer to such question, I would argue that in the general case we are. We now have APIs at two distinct levels:
-
High-level API
Use the
letter_counts_take10
function, passing it thetext
and optionallang
. If it doesn’t support a given language or letter counting set of rules, leverage the low-level API to describe the desired operations, first. -
Low-level API.
Use the
register_lang
function along with the alphabet generation and letter-map operations to alter/define new letter counting sets of rules, building on top of theunicode_alphabet
,discard
,change_case
, andstrip_accents
existing functions.
In fact, nothing stops the caller from supplying their own custom letter-map operation functions. The API is still pretty wide open.
Turkish Lower-/Upper-Casing
Written Turkish uses the Latin alphabet with, at least, one particular detail that will deserve our attention now:
- The letter
'i'
upper-cases to'İ'
. - The letter
'I'
lower-cases to'ı'
.
However, Python string operations do not handle this particular case-change:
>>> 'i'.upper() # Should be 'İ'.
'I'
>>> 'I'.lower() # Should be 'ı'.
'i'
>>> 'ı'.upper() # This one is correct.
'I'
>>> 'İ'.lower() # Should be 'i'.
'i̇'
What this means is that our existing change_case
function may not be fit for processing Turkish text, given that it uses the str.lower
and str.upper
methods. However, since we have an open ended low-level API, we can always create a custom turkish_upper_case
letter-map operation function, using it along with the existing functions. Here’s a rough take on that:
def turkish_upper_case():
""" ... """
exceptions = {'i': 'İ', 'ı': 'I'}
def operation(letter_map):
for letter, mapped in letter_map.items():
if mapped is None:
continue
letter_map[letter] = exceptions.get(mapped, mapped.upper())
return operation
…which could then be used when registering the Turkish language…
>>> turkish_alphabet = unicode_alphabet('LATIN')
>>> turkish_operations = (discard(), turkish_upper_case(), strip_accents(exceptions='ÇĞIİÖŞÜ'))
>>> register_lang('turkish', turkish_alphabet, turkish_operations)
…and then:
>>> letter_counts_take10('Günaydın!', lang='turkish')
Counter({'N': 2, 'G': 1, 'Ü': 1, 'A': 1, 'Y': 1, 'D': 1, 'I': 1})
>>> letter_counts_take10('Diyarbakır', lang='turkish')
Counter({'A': 2, 'R': 2, 'D': 1, 'İ': 1, 'Y': 1, 'B': 1, 'K': 1, 'I': 1})
This short Turkish lower-/upper-casing interlude goes to show that, indeed, our revised API design is better:
- We have a simple way to use it, as long as we’re working with a language for which the rules are known.
- We have a powerful way to use it, registering new languages and rules, which can even be extended with custom code.
On top of that, design-wise, we have a simple way of supporting as many “built-in” languages as we want, via the declarative approach we took with the _LANGUAGES
dictionary.
Other Directions
One other API design consideration which we can take under the “usability” motto, is its ability to handle arbitrarily large input.
From our very first take on the letter counting function, the API expects the full text
to be passed in, counting all the letters in one go.
Given that it is very plausible that any real-world text would be sourced from either a file or a network connection, does it make sense to require the full text to be loaded into memory before processing it? Isn’t that unnecessarily limiting and wasteful? Of course it all depends on the particular set of requirements and possible use cases we envision, but let’s assume we’d like to be able to drop that requirement.
A possible solution would be to create a LetterCounter
class, which would separate the letter processing initialization from the actual letter counting, keeping track of it in a dedicated Counter
object, in turn supporting incremental updates:
class LetterCounter(object):
def __init__(self, text='', lang=None):
alphabet_ops_callable = _LANGUAGES[lang]
alphabet, operations = alphabet_ops_callable()
letter_map = create_letter_map(alphabet, operations)
self._xlate_table = str.maketrans(letter_map)
self._counter = Counter()
self.update(text)
def update(self, text):
just_letters = text.translate(self._xlate_table)
return self._counter.update(just_letters)
@property
def counts(self):
return self._counter
Mostly, what we’ve done here was breaking the letter_counts_take10
in two (plus one):
-
The
__init__
method, with the same argument signature, handles the setup of the alphabet and letter-map operations to create a translation table which is stored in theself._xlate_table
attribute for later use by theupdate
method. It then initializes theself._counter
attribute to a freshly createdcollections.Counter
object which is updated with the passed intext
. -
The
update
method which actually does the counting: first by applying the pre-calculated string translation table to the passed intext
, then by using theCounter.update
method to actually update the per-letter counts. -
For convenience, we also added the
counts
property to expose the actual counts in a simple and controlled way2.
Let’s take it for a spin:
>>> lc = LetterCounter('Hello')
>>> lc.counts
Counter({'L': 2, 'H': 1, 'E': 1, 'O': 1})
>>> lc.update('there!')
>>> lc.counts
Counter({'E': 3, 'H': 2, 'L': 2, 'O': 1, 'T': 1, 'R': 1})
Let’s now compare that result with the “all in one go” call:
>>> LetterCounter('Hello there!').counts
Counter({'E': 3, 'H': 2, 'L': 2, 'O': 1, 'T': 1, 'R': 1})
It works pretty much as expected, good.
This now allows us to rewrite the plain letter counting function on top of it…
def letter_counts_take11(text, lang=None):
""" ... """
return LetterCounter(text=text, lang=lang).counts
…as well as a higher level function to count letters in an arbitrarily large file, for example:
def letter_counts_file(filename, encoding='UTF-8', lang=None):
""" ... """
lc = LetterCounter(lang=lang)
with open(filename, encoding=encoding) as f:
for line in f:
lc.update(line)
return lc.counts
The LetterCounter
class implementation is very simple and, note, I have opted to leave the alphabet generation and letter-map operation functions, as well as the module level _LANGUAGES
dictionary, along with the register_lang
function, outside of the class scope. Could each of those have been moved into the class? While they could, I found no big benefit in doing that for two key reasons:
-
The first one being that that would distract from the fundamental idea and motivation for the class creation: the tracking of state with both the continuously used string translation table and updatable letter counts.
-
The second one being the fact that, other than having everything “nice and tidy” under the same class, no other immediate benefit would be obtained from that. Keeping our existing functions — and module level
_LANGUAGES
dictionary — the way they are required no changes and lead to no additional complexity: neither in the existing code nor in foreseeable future changes.
Wrap up
I feel I painted myself a bit into a corner with this article series, and in particular with this specific article. Usability and API design are, by themselves, very wide ranging topics, which I’m conflating a bit here. Nonetheless, being more of a beginner oriented journey and less of a very directed, single-topic, all-encompassing writing, I do believe there is valuable information herewith — at least in motivating readers, raising awareness on the topic. Without much further ado, let’s review the key ideas here:
-
Usability, as in “fitness to be used for a given purpose” is definitely subjective.
We explored it along two lines: first, striving to deliver a more intuitive letter counting API to casual callers, while still exposing the original, more powerful one; then, supporting incremental text input, not requiring the full text to be in memory at any given time.
-
Balancing simple APIs with power and flexibility.
This is often achieved by having a simple, zero-boilerplate API that handles common use cases, assuming a given set of default behaviors. We did this by having
letter_counts_take10
take a language argument, driving its work from an internal language registry, the_LANGUAGES
dictionary. We then exposed a second, more-powerful API, allowing callers with specific needs to leverage the existing code, by registering new languages and their associated letter-counting operations which, due the use of functions as arguments, can even integrate custom code (as the Turkish lower-/upper-casing example has shown).The fundamental aspect we delivered in simplifying the letter counting function API, lies in the fact that we replaced an argument describing “how to” count letters (the
letter_map
argument inletter_counts_take9
) with a higher level one specifying “what” language rules we want to apply (lang
inletter_counts_take10
).In our example, using an intermediary level of indirection between the simple and the advanced API, via the module level
_LANGUAGES
dictionary, resulted in a clean and easy to manage solution. Many times, creating a level of indirection between two complementary (or completely separate) aspects of the code is a good design principle (maybe we’ll visit this topic in the future).A useful tool I found a while back is https://python.apichecklist.com. Being a very comprehensive checklist, I’ve found that going through it helps pinpoint improvements to any API I may be evaluating at any given time. Use it as you see fit, either as “strict requirements” or more as “general guidelines”. -
Using a class became useful when we wanted to track state across multiple, incremental invocations.
In particular, our implementation tracks two pertinent things: the string translation table — that only needs to be calculated once, and may be computationally costly to create — and, naturally, the running letter count, under a
Counter
object.We left everything else out of the class — alphabet generation and letter-map operation functions, the languages registry dictionary, and the
register_lang
function. Sometimes, just a function is simpler and good enough for a given purpose.For an interesting, if somewhat provocative talk, you may want to dedicate ~20m watching Jack Diederich’s, PyCon 2012, Stop Writing Classes talk, in which Jack highlights the importance of not writing classes up until they’re really needed — as a self-note, I wonder if we could have created a letter counting API supporting incremental input without creating a class… I guess we could, but in this particular case, the need to track a given letter count along with its associated string translation table, while keeping them somehow together and not stepping into other possible concurrent letter counting, clearly justifies it.
So this is it for now. I expect this installment, while maybe a bit less structured than the ones before it, resonates with the readers somehow, helping them develop their own awareness and acuity in topics like API design, by sharing a few possible ideas on how to achieve that.
Thanks for reading. See you soon.
-
A “module” is a file containing Python code, normally with a “.py” extension. Thus, by “module level”, I mean a global variable declared in the source Python file we’ve been working with. Such variables are also called “module global”. ↩
-
Other options would be valid and could even be more useful and powerful, depending on the use cases, including automatically exposing the
Counter
object’s API at theLetterCounter
level, for example. However, this being a beginner directed article, I’ll refrain from exploring more advanced API options, for now. ↩