Reading ARFF files with Elixir

By: on August 30, 2019

If you are implementing a machine learning approach, you are likely to want to test it on publicly available datasets. A large number of these datasets use the ARFF file format established by Weka. I am not aware of any Elixir ARFF readers, so I am going to explore writing one (‘Arfficionado‘) in this blog.

Brief outline of ARFF

ARFF (Attribute-Relation File Format) specifies a header section and a data section. The header section declares a dataset name as well as the names and types of the dataset attributes. The data section specifies one instance per line by listing the attribute values (in the attribute order specified in the header section), followed by an optional instance weight. Commas or tabs are used to separate the values, and any whitespace around values is to be ignored. Missing values are represented by ?. Any text following a % is treated as a comment to the end of the line. Values that contain spaces, commas, etc. can be quoted in single or double quotes (which can be escaped with a backslash). Empty lines and comment lines are allowed. There also exists a sparse-encoding mode for the data section, which lists non-zero attributes as index-value pairs. Attributes can be numeric (integer or real), nominal (enumeration), string (suitable for text), date (ISO-8601 or a java.text.SimpleDateFormat string), or relational (for future use).

Envisaged ARFF reader usage

There are a number of things you might want to do with an ARFF file, for example:

  • filter instances according to some criteria
  • collect all instances and return them as a list
  • insert instances into an ets table
  • read and process the file in batches

It seems reasonable to feed an ARFF file to the reader as a stream of lines and to pass a callback module that provides the desired specific behaviour.

File.stream!("example.arff")
|> Arfficionado.read(handler_module, initial_handler_state)

The reader will parse the ARFF file and for each relevant event invoke a corresponding handler callback. The reader keeps track of the handler state. When invoking a callback, it passes in the current state and receives the updated state in return. This allows a handler module to accumulate data, manage ets references, etc. The final handler state will be returned to the caller.

Implementation ideas

ARFF can be tokenized and parsed line by line. Recursive descent is an appropriate technique for this. The official ‘spec’ can be translated into unit tests. Attribute information is spread over multiple lines and needs to be accumulated so it can be used to cast the instance values appropriately.

The input stream can be consumed by Enum.reduce_while/3, which ultimately allows the handler module callbacks to abort processing before the stream is exhausted. The handler callbacks have to return {:cont | :halt, state} for this to work.

Once the basic structure is in place, it makes sense to test it by processing large collections of ARFF files. In the future, this could be extended to roundtrip tests where a handler creates an output ARFF file from the events it receives, and input and output have to agree (minus differences in whitespace, which are likely to be lost).

Implementation

I followed the ideas outlined above and ended up with 231 lines of Elixir code, which I consider quite compact. In its current incarnation, Arfficionado defines the following handler callbacks:

@callback line_comment(String.t(), state()) :: updated_state()
@callback relation(String.t(), comment(), state()) :: updated_state()
@callback attributes(attributes(), state()) :: updated_state()
@callback begin_data(comment(), state()) :: updated_state()
@callback instance(values(), integer(), comment(), state()) :: updated_state()
@callback close(state()) :: state()

The only purpose of line_comment/2 and begin_data/2 is to capture possible comments in the input file, to enable future roundtrip tests. In the interest in slimming down the behaviour, I might remove these callbacks at some point. The callbacks relation/3, attributes/2 and especially instance/4 provide actual ARFF information. close/1 is called when the input stream is exhausted or aborted, so as to give the handler module the opportunity to clean up.

While Arfficionado handles a large proportion of my test files well, it currently has some significant limitations:

  • ARFF allows custom date/time formats following the conventions of java.text.SimpleDateFormat. I spent zero cycles on this and therefore, Arfficionado only supports the ISO-8601 format (and does not depend on any external libraries).
  • Backslash escapes in quoted strings are currently not supported.
  • ARFF allows attributes of type ‘relational’, support for which I have not yet implemented.
  • ARFF has a sparse encoding mode, which I have not yet implemented either.
  • Most ARFF files are well-formed, but some deviate from the ‘spec’. Sufficiently broken ARFF files will currently lead to exceptions when reading. An error handling / recovery mechanism would be beneficial.
  • The type specs for the handler callbacks could be made significantly tighter.
  • I have not benchmarked Arfficionado and expect there to be some room for improvement.

Conclusion and future work

It was fun to build a first version of an Elixir ARFF reader. My highest priority improvements are adding a good error handling / recovery mechanism and making Arfficionado more lenient with respect to commonly found (benign) deviations from the ARFF spec.

Share

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*