TCS Alignment Toolbox Version 3.1.1¶
Copyright (C) 2016-2018
Benjamin Paaßen
AG Theoretical Computer Science
Centre of Excellence Cognitive Interaction Technology (CITEC)
University of Bielefeld
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
Introduction¶
IMPORTANT: A NEW AND EASIER-TO-USE VERSION OF THIS TOOLBOX IN PYTHON IS AVAILABLE AT https://gitlab.ub.uni-bielefeld.de/bpaassen/python-edit-distances/
This Java Toolbox provides several edit distance algorithms between two data structures, be it lists, sets, or trees, which may contain arbitrary kinds of nodes. The toolbox also provides options to parametrize the edit distance, to compute optimal solutions via backtracing, and to compute gradients of the edit distance with respect to metric parameters. The toolbox is written in Java 1.7 and is compatible with MATLAB (R).
If you use this toolbox in academic work, please cite:
- Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632-632). International Educational Datamining Society. (Link)
To use this software is recommended if
- you want to use an alignment distance as basis for some relational classifier on structured data, e.g.: k-nearest-neighbour-classification of biological sequence data or syntax trees. For those cases we also provide gradients to use a learning scheme for classifier optimization.
- you want to use edit distances on structured data <em>and</em> want to optimize your metric parameters using gradient-based techniques. For that we provide the gradient on the edit distances.
- you want to use an edit distance in some other way, but your data is too complex for other tools, e.g. you are using multi-modal sequences, where each node of the sequence has several features.
- you want to use a non-standard function to define the distance between elements. Our architecture supports the implementation of custom comparators that can be easily plugged in without any changes in our code base.
- you want to compute a non-standard edit distance. Our architecture also supports the abstract definition of edit distances and provides the efficient implementation automatically using Algebraic Dynamic Programming.
This toolbox is not the right tool for you if you want to do standard bulk computation of dynamic time warping on one-dimensional double sequences or simple string edit distances. In that case there are faster tools out there.
Installation¶
The TCS Alignment Toolbox can be directly used in your Matlab/Java-program as a .jar file or via maven.
Binary Distribution (.jar file)¶
The binary distribution of the full TCS Alignment Toolbox can be found here.
You can use it in Matlab using the command
javaaddpath alignment-3.1.1-full.jar;
Maven modules¶
The TCS Alignment Toolbox is available on maven central. For most standard cases it should suffice to declare the following dependencies:
<dependencies> <dependency> <groupId>de.cit-ec.tcs.alignment</groupId> <artifactId>comparators-lib</artifactId> <version>3.1.1</version> <scope>compile</scope> </dependency> <dependency> <groupId>de.cit-ec.tcs.alignment</groupId> <artifactId>algorithms-lib</artifactId> <version>3.1.1</version> <scope>compile</scope> </dependency> </dependencies>
The different modules are discussed in more detail in section Modules.
Source Code¶
If you want to inspect the source code or compile the project yourself, you can either download the sources from maven, or check out the GIT repository using the command
git clone https://openresearch.cit-ec.de/git/tcs.tcs-alignment.git
You can also inspect the GIT repository in your browser: Link.
Quickstart Guide¶
If you want to use the Toolbox as fast as possible it is recommended to use one of the code examples in the 'examples' directory and develop your application code from there on.
You can get a copy of the examples by cloning the GIT repository as described above.
If you want to apply the Toolbox in an educational datamining context, we suggest to have a look at our demo with graphical user interface as presented at the International Educational Datamining Conference.
Features¶
The idea behind the toolbox is to enable alignments of arbitrary input sequences as well as metric learning on these alignments. We approach this problem by a clean software architecture. As input for alignments we use the Java List
interface. As elements of the lists you can use arbitrary objects, as long as you provide a Comparator
that works on such objects. A Comparator
computes a distance between elements, which an AlignmentAlgorithm
can then use to compute a distance on sequences of such elements.
For convenience, we provide several standard implementations of the Comparator
interface (e.g. Euclidean distance, Manhattan-Distance, Normalized Compression Distance and Character Distances) as well as the AlignmentAlgorithm
interface (edit distance and dynamic time warping), which can be found in the comparators-lib
and the algorithms-lib
package respectively.
The toolbox supports parallel computation of many alignments via the ParallelProcessingEngine
and the SquareParallelProcessingEngine
.
To enable a detailed inspection of alignments we provide the option to not only calculate the alignment distance but to also a backtrace of the alignment operations that have been used to arrive at the alignment distance. Detailed HTML visualizations of such alignments can be generated as well using the visualization
module.
Metric learning is possible using gradient descent schemes on the alignment distance. This kind of metric learning has been applied successfully in several publications.
The toolbox is build upon the powerful theoretical framework of Algebraic Dynamic Programming which provides a generic approach to compute a wide variety of alignment schemes efficiently. Our implementation of ADP enables the definition of custom alignment schemes as well (refer to the adp
module).
Usage¶
In order to use the Toolbox, you'll need to execute (some of) the following steps:
- Generate
List
objects of your input data. Theprimitives
module provides some convenience functions in case you are dealing with primitive data types. - Create a
Comparator
object that works on elements of your sequences. Some standard implementations are provided by thecomparators-lib
module. - Create an
AlignmentAlgorithm
object of your choice and provide theComparator
in the constructor. Some standard implementations are provided by thealgorithms-lib
module. - Optionally initialize a
SquareParallelProcessingEngine
with your sequences and your algorithm to use multi-thread computation and callsetFull()
. - Calculate the alignment results for your input.
- Optionally calculate derivatives of your alignment results.
- Optionally visualize your alignment as HTML.
Some code examples for Java and Matlab are listed in the 'examples' directory.
Algorithms¶
AlignmentAlgorithm
implementations may differ in the underlying algorithm as well as the return value. In the algorithms-lib
module we provide standard implementations for the classic edit distance as well as dynamic time warping. For each of those we provide a Score
version which just returns the numeric score and a Full
version which returns an Alignment
object containing the actual Operation
objects that correspond to the alignment distance. Therefore, if you want to compute the optimum Alignment
according to a dynamic time warping distance, you should use the StrictDTWFullAlgorithm of the algorithms-lib
module.
Oftentimes not only one Alignment
corresponds to the alignment distance, but there are multiple co-optimal Alignments
. Further, there might be Alignments
which are not optimal, but near misses. This is captured by other data structures in the algorithms
module. Co-optimal alignments can be stored in an AlignmentList
, as it is returned by the StrictAlignmentAllOptimalAlgorithm
. A model that entails all co-optimal alignments in a compressed fashion is the CooptimalModel
. If sub-optimal alignments should be considered as well, they can be stored in a AlignmentMap
, which maps scores to Alignments
.
An entirely different view on co-optimal and suboptimal alignments is offered by the Soft
alignment scheme, where we do not take the strict minimum of the possible distance but a soft minimum. his is a key feature to achieve robust metric learning results and is described in more detail in our papers (see Background)
To increase maintainability we do not provide standard implementations for each of these variations in the algorithms-lib
module. Rather, we provide an abstract version in the adp
module, which then can be applied to any algorithm of your liking.
In general, if you want to construct your own AlignmentAlgorithm
implementations we recommend the adp module, which is built on Algebraic Dynamic Programming. This is explained in the adp example in the 'examples' directory and in more detail in the paper Adaptive structure metrics for automated feedback provision in intelligent tutoring systems.
Runtime¶
It is a general result in algebraic dynamic programming that alignment algorithms have the same asymptotic runtime, namely O(m ⋅ n) where m is the length of the first and n is the length of the second input sequence. However, constant factors differ depending on the implementation. The most important rules of thumb are:
- Dynamic Time Warping (DTW) is faster than other alignment schemes,
- parallel processing is much faster than iterative processing,
- hand-tailored implementations are much faster than ADP generated ones.
Also note that not only the choice of algorithm influences the runtime but also the runtime of your Comparator
.
The following figures show the runtime for one alignment in seconds versus the sequence length for some example algorithms. The sequences were random strings generated over the alphabet {a, b, c, d}. More details on the evaluation can be found in the performance evaluation script in the 'performance' directory.
The following figure shows (for the case of the global alignment strict score algorithm) how much runtime can be saved by using parallel processing instead of iterative processing. All pairwise alignments where calculated for an ascending number of sequences. The sequence length was fixed to 50.
The following figure show the runtime to compute the gradient of one alignment versus the sequence length. Asymptotically, gradient calculation over a strict alignment has linear complexity (as an alignment can have at most O(m + n) operations), while gradient calculation over a soft alignment has quadratic complexity (or O(m ⋅ n)). However, the gradient tends to vanish far away from the most likely alignment path, such that we obtain favourable constant factors.
Further Documentation¶
Further documentation on the TCS Alignment Toolbox can be found in
- the 'examples' directory in the GIT repository,
- The javadoc,
- Our research papers, in particular the paper Adaptive structure metrics for automated feedback provision in intelligent tutoring systems,
- The 'uml' directory in the GIT repository,
- The 'performance' directory in the GIT repository and
- The 'bellmans_gap' directory" in the GIT repository.
Background¶
This Toolbox has been developed in 2013-2016 in the Theoretical Computer Science for Cognitive Systems Workgroup at the Centre of Excellence 'Cognitive Interaction Technology' (CITEC), Bielefeld University. Development is part of the research project 'Learning Dynamic Feedback for Intelligent Tutoring Systems' (DynaFIT), grant number HA 2719/6-2, which is part of the DFG priority program 'Autonomous Learning'.
If you use this toolbox in academic work, please cite:
- Paaßen, B., Mokbel, B., & Hammer, B. (2015). A Toolbox for Adaptive Sequence Dissimilarity Measures for Intelligent Tutoring Systems. In O. C. Santos, J. G. Boticario, C. Romero, M. Pechenizkiy, A. Merceron, P. Mitros, J. M. Luna, et al. (Eds.), Proceedings of the 8th International Conference on Educational Data Mining (pp. 632-632). International Educational Datamining Society. (Link)
We have published several articles using this toolbox, in particular for metric learning:
- Paassen, B. (2015). Adaptive Affine Sequence Alignment Using Algebraic Dynamic Programming, Master Thesis, Bielefeld University (Preprint)
- Mokbel, B., Paassen, B., & Hammer, B. (2014). Adaptive Distance Measures for Sequential Data In M. Verleysen (Ed.), ESANN, 22nd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 265-270). Bruges, Belgium: i6doc.com. (Preprint)
- Mokbel, B., Paassen B., & Hammer B. (2014). Efficient adaptation of structure metrics in prototype-based classification In: S. Wermter, C. Weber, W. Duch, T. Honkela, P. Koprinkova-Hristova, S. Magg, G. Palm, et al. (Eds.), Lecture Notes in Computer Science: Vol. 8681. Artificial Neural Networks and Machine Learning - ICANN 2014 - 24th International Conference on Artificial Neural Networks, Hamburg, Germany, September 15-19, 2014. Proceedings (pp. 571-578). Springer. (Preprint)
- Mokbel, B., Paaßen, B., Schleif, F.-M., & Hammer, B. (2015). Metric learning for sequences in relational LVQ. In: Neurocomputing, 169, 306-322. (Preprint)
- Paassen, B., Mokbel, B., & Hammer, B. (2015). Adaptive structure metrics for automated feedback provision in Java programming In: ESANN, 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (pp. 307-312). Bruges, Belgium: i6doc.com. (Preprint)
- Paaßen, B., Mokbel, B., & Hammer, B. (2016). Adaptive structure metrics for automated feedback provision in intelligent tutoring systems. Neurocomputing, 192, 3-13. (Preprint)
- Paaßenm B., Gallicchio C., Micheli A., & Hammer B. (2018). Tree Edit Distance Learning via Adaptive Symbol Embeddings. In: Dy J., Krause A. (eds.). Proceedings of the 35th International Conference on Machine Learning (ICML 2018). Proceedings of Machine Learning Research. Vol 80. (Link)
- Paaßen, B. (2018). Revisiting the tree edit distance and its backtracing: A tutorial. arXiv: 1805.06869 [cs.DS]
Modules¶
The toolbox is structured into several maven modules with different purposes. For practical applications it is ususally sufficient to declare the following dependencies:
<dependencies> <dependency> <groupId>de.cit-ec.tcs.alignment</groupId> <artifactId>comparators-lib</artifactId> <version>3.1.1</version> <scope>compile</scope> </dependency> <dependency> <groupId>de.cit-ec.tcs.alignment</groupId> <artifactId>algorithms-lib</artifactId> <version>3.1.1</version> <scope>compile</scope> </dependency> </dependencies>
The full dependency structure between modules is visualized in this UML diagram:
In more detail, the modules are:
Core Modules¶
de.cit-ec.tcs.alignment:parallel
contains a wrapper for the java standard libraryThreadPool
to enable parallel processing with reports on the computation progress. This is a very generic implementation and might be useful for applications outside the scope of this toolbox as well.de.cit-ec.tcs.alignment:comparators
defines theComparator
interface as well as related interfaces and some convenience functions to define the distance between sequence elements.de.cit-ec.tcs.alignment:algorithms
defines theAlignmentAlgorithm
interface as well as helper classes such as data structures for alignment results and theParallelProcessingEngine
.
Library Modules¶
de.cit-ec.tcs.alignment:comparators-lib
contains some standard implementations of theComparator
interface, such as the Euclidean distance for vectors, a simple character frequency distance on strings or a Kronecker-delta on characters.de.cit-ec.tcs.alignment:algorithms-lib
contains some standard implementations of theAlignmentAlgorithm
interface, in particular edit distance/global alignment and dynamic time warping.de.cit-ec.tcs.alignment:sets
contains algorithms to align unordered sequences (that is, sets) using the Hungarian algorithms.de.cit-ec.tcs.alignment:trees
contains algorithms to align trees and forests using Zhang & Shasha's (1989) tree edit distance algorithm.
Utility Modules¶
de.cit-ec.tcs.alignment:primitives
contains convenience functions to transfer primitive datastructures to lists of primitives.de.cit-ec.tcs.alignment:adp
contains an implementation of a generic dynamic programming algorithm for alignment schemes based on Algebraic Dynamic Programming. Using these capabilities you can comfortably define your own non-standard alignment schemes without having to worry about implementation details. More information on this particular topic is contained in theadp
example in the 'examples' directory and in the paper Adaptive structure metrics for automated feedback provision in intelligent tutoring systems.de.cit-ec.tcs.alignment:visualization
contains helper classes to produce a HTML visualization of Alignment objects.de.cit-ec.tcs.alignment:learning
contains support for metric learning using the Large Margin Nearest Neighbor (LMNN) approach by Weinberger, Saul et al. The details of this method are described in the master's thesis Adaptive Affine Sequence Alignment Using Algebraic Dynamic Programming.
Sequence Datastructure Modules¶
The Sequence
datastructure was mandatory in earlier versions of the TCS Alignment Toolbox and is built for multi-dimensional and multimodal sequences. It has been made fully compatible with the new toolbox version, but the use is not mandatory anymore and it is mainly kept to achieve backwards-compatibility.
de.cit-ec.tcs.alignment:sequences
defines the Sequence data structure as well as the possible ValueTypes and KeywordSpecifications. In essence it contains everything you need to define input for the alignment toolbox.de.cit-ec.tcs.alignment:sequence-comparators
contains wrapper classes forComparators
on primitive data types to be applied on Node objects.de.cit-ec.tcs.alignment:wrappers
contains helper classes for some standard applications of alignment, in particular string edit distance and simple time series data. Using these convenience classes it gets easier to convert existing data to Sequence objects or to set up an AlignmentSpecification.de.cit-ec.tcs.alignment:sequence-visualization
contains implementations of the HTMLColumn interface that make it easier to create HTML visualizations of Alignment objects resulting from the Sequence datastructure.de.cit-ec.tcs.alignment:csv
contains two classes for exporting and importing Sequence objects to and from a human-readable CSV file. Special thanks go to Bassam Mokbel for this implementation.
Dependencies¶
As general utility library we employ lombok in all modules in accordance to the MIT License.
For the normalized compression distance (NCDComparator
) in the comparators-lib
module we use the fast LZ4 implementation by jpountz. Our use is in accordance with the Apache License, Version 2.
For the export and import of NodeSpecifications
in the csv
module, we employ the JSON reference implementation of Douglas Crockford according to the JSON License.
License¶
The TCSAlignmentToolbox is licensed under the AGPLv3. The full license text can be found here.
This documentation is licensed under the conditions of the Creative Commons Attribution Share-Alike International License 4.0.