Outline of a trainable, streaming tokenizer for NLP with Elixir

By: on September 16, 2019

Virtually all NLP tasks require some form of tokenization, and in many cases the tokenizers provided by popular NLP libraries are adequate. If, however, the input material strays sufficiently from the norm, the available tokenizers may not be satisfactory and it may turn out that it is nearly impossible or far too costly to adapt them to a new domain. Manually crafted tokenizers, on the other hand, tend to be brittle and hard to maintain. A possible way out is the automated construction of tokenization rules from example data via machine learning techniques.

Problem representation

The tokenization task can be viewed as a sequence tagging problem. Each character in the input is to be assigned one of the following tags:

  • I – character is part of the current token
  • O – character is outside of a token (e.g. whitespace)
  • B – character marks the start of a new token

For example:

Testing 1 2 3.

The tagged character stream can then be broken up into tokens.

In general, IOB-tagging is a well established technique, especially for Named Entity Recognition and Phrase Chunking. Its application to tokenization is described very compactly in: A General-Purpose Machine Learning Method for Tokenization and Sentence Boundary Detection by Valerio Basile, Johan Bos, Kilian Evang.

Elixir—forming tokens from tagged characters

Elixir has a Stream module for lazy computations, which makes it possible to process an input (stream) piece by piece on demand (as opposed to having to load and process the entire input in one go). This keeps memory consumption under control, which is beneficial if you envisage processing multiple, potentially large, inputs in parallel on a server.

The first task is to transform a stream of tagged character tuples into Strings that represent tokens, i.e. turn:

[{?T, :b}, {?e, :i}, {?s, :i}, {?t, :i}, {?i, :i}, {?n, :i}, {?g, :i}, {? , :o}, {?1, :b}, {? , :o}, {?2, :b}, {? , :o}, {?3, :b}, {?., :b}]


["Testing", "1", "2", "3", "."]

Stream.transform/3 takes:

  • an enum (the stream that is to be transformed; tagged characters)
  • an accumulator (initial state; outside of a token)
  • a reducer (a function that takes a stream element and an accumulator and returns an enum and the updated accumulator or {:halt, accumulator})

Per se, the reducer has no way of knowing when the input stream is exhausted, but it needs to know, in order to determine that the last token’s assembly is complete. One solution is to append a sentinel element to the original stream. When the reducer encounters the sentinel element, it knows that the current token is finished.

defmodule Tokenizer do

  def form_tokens(enum) do
    |> Stream.concat([:r_sentinel])
    |> Stream.transform(
      {:outside, []},
      fn tagged_char, {state, acc} ->
        assemble_tokens(tagged_char, state, acc)

The reducer is a small state machine that can either be ‘inside’ or ‘outside’ a token. When ‘inside’ a token, it buffers up characters until a transition to ‘outside’ is made or the sentinel is encountered.

  def assemble_tokens({_, :o}, :outside, []), do: {[], {:outside, []}}
  def assemble_tokens({c, _}, :outside, []), do: {[], {:inside, [c]}}
  def assemble_tokens({c, :i}, :inside, acc), do: {[], {:inside, [c | acc]}}
  def assemble_tokens({c, :b}, :inside, acc), do: {[make_token(acc)], {:inside, [c]}}
  def assemble_tokens({c, :o}, :inside, acc), do: {[make_token(acc)], {:outside, []}}
  def assemble_tokens(:r_sentinel, :inside, acc), do: {[make_token(acc)], {:outside, []}}
  def assemble_tokens(:r_sentinel, :outside, []), do: {[], {:outside, []}}

Trying it out:

iex(1)> Tokenizer.form_tokens([{?T, :b}, {?e, :i}, {?s, :i}, {?t, :i}, {?i, :i}, {?n, :i}, {?g, :i}, {? , :o}, {?1, :b}, {? , :o}, {?2, :b}, {? , :o}, {?3, :b}, {?., :b}]) |> Enum.to_list
["Testing", "1", "2", "3", "."]

Elixir—tagging characters

For the tagging task, we slide a fixed-size window over the character stream. The character that is about to receive its tag is at the centre of the window. The left part of the window contains the three preceding characters with their assigned tags. The right part of the window contains the following three characters. The window represents the information that is given to a classifier in order to determine the tag for the centre character.

With this technique, raw training data can be generated from input texts and then corrected manually to reflect the desired tokenization. Machine learning can then be applied to construct a classifier for subsequent use in the system.

Rather than starting absolutely from scratch, it makes sense to implement a simple baseline classifier first. For the purpose of this blog, a basic whitespace tokenizer is sufficient. The corresponding tagging function will assign:

  • O when the centre character is a space or newline
  • B when the preceding character was tagged O or has no tag
  • I otherwise.

It is possible that the input character stream is shorter than the window is wide, and therefore, we prepend sentinel elements to the front of the character stream. These elements have no tags. The tagger for the whitespace tokenizer is thus:

  def baseline_tag({_, _, _, c, _, _, _}) when c == 32 or c == ?\n, do: :o
  def baseline_tag({_, _, {_, l1_tag}, _, _, _, _}) when l1_tag == nil or l1_tag == :o, do: :b
  def baseline_tag({_, _, _, _, _, _, _}), do: :i

As with the token forming task, we need to add a sentinel element to the end of the stream to mark the end of the input. When this sentinel reaches the centre of the sliding window, the stream has been processed fully.

The initial value for the sliding window is:

  def initial_window(),
    do: {{:l, nil}, {:l, nil}, {:l, nil}, {:l, nil}, {:l, nil}, {:l, nil}, {:l, nil}}

When processing a textual input stream, the next character of the current line will be pushed into the window from the right, yielding an updated window and remaining input.

  def push({_, l2, l1, c, r1, r2, r3}, <<n::utf8, rest::binary>>),
    do: {{l2, l1, c, r1, r2, r3, n}, rest}

If the sentinel value is encountered in the input stream, we push it into the window and return the updated window and the sentinel value. This way, a sequence of sentinel values will be pushed into the window one step at a time.

  def push({_, l2, l1, c, r1, r2, r3}, :r_sentinel),
    do: {{l2, l1, c, r1, r2, r3, :r}, :r_sentinel}

The input stream is processed line by line (on demand); as with the token forming task, this is driven by Stream.transform/3. The reducer iterates over the characters in the line, updates the window, tags the centre character and accumulates character-tag pairs. When the input line is exhausted, the accumulated character-tag pairs and the current window are returned. The pairs are returned one by one by the transformed stream, and when the next line needs to be processed, it is done on the basis of the previously returned window.

  # current line exhausted, return accumulated pairs + window for next round
  def tag_characters(<<>>, window, acc, tag_fun), do:
    {Enum.reverse(acc), window}

  # r-sentinel reached, return accumulated pairs
  def tag_characters(_, {_, _, _, :r, _, _, _} = window, acc, tag_fun), do:
    {Enum.reverse(acc), window}

  # l-sentinel in centre, move on
  def tag_characters(input, {_, _, _, {:l, _}, _, _, _} = window, acc, tag_fun) do
    {next_window, remaining_input} = push(window, input)
    tag_characters(remaining_input, next_window, acc, tag_fun)

  # character in centre, classify, update window, recurse
  def tag_characters(input, {l3, l2, l1, c, r1, r2, r3} = window, acc, tag_fun) do
    {tagged_window, tagged} = tag(window, tag_fun)
    {next_window, remaining_input} = push(tagged_window, input)
    tag_characters(remaining_input, next_window, [tagged | acc], tag_fun)

The last parameter in tag_characters is the tagging function to use. Tagging a window:

  def tag({l3, l2, l1, c, r1, r2, r3} = window, tag_fun) do
    t = tag_fun.(window)
    {{l3, l2, l1, {c, t}, r1, r2, r3}, {c, t}}

Tagging an input stream can be written analogously to form_tokens/1:

  def tag_chars(enum, tag_fun \\ &baseline_tag/1) do
    |> Stream.concat([:r_sentinel])
    |> Stream.transform(
      {initial_window(), []},
      fn line, {window, acc} ->
        {tagged_chars, next_window} = tag_characters(line, window, acc, tag_fun)
        {tagged_chars, {next_window, []}}

Putting it all together:

  def stream(s) when is_binary(s), do: stream([s])

  def stream(enum, tag_fun \\ &baseline_tag/1) do
    |> tag_chars(tag_fun)
    |> form_tokens()


Example invocation:

iex(2)> Tokenizer.stream("this is a test") |> 
...(2)> Enum.to_list()
["This", "is", "a", "test"]


I have outlined how, based on sequence tagging, a streaming tokenizer can be implemented in Elixir. The next step would be to construct a classifier for a particular domain via machine learning.


Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>