This example needs the Java plugin! You can get it here. If everything works fine, a new window with some circles in it should have popped up. (I have only tested it with Java Plugin 1.3 and IE 6, so please report if you don't see the window).
This applet shows how an M-Tree (metric tree) can be used to make search operations on data sets in metric spaces pretty fast.
It uses my MTree implementation in Java based on the C++ library by Patella, Ciaccia, and Zezula (see the project page at university of Bologna) and the underlying paper on Citeseer, and the Java port of the GiST library by the University of South Africa
Before you read on: turn of "display tree structure" before you add 1000s of points or perform a search operation! Unfortunately the VM used for browsers runs in client mode, which means faster startup but slower operation. Turning the VM to server mode can speed the demo up 80-100%.
What is it about?
In the demo you can add points to the index and perform a "range query" on it (if you click "search mode" a green circle appears around the cursor, and all points within the circle are marked green).The range query looks for the objects that are near the mouse cursor. The range query returns the answer to the question "give me all objects with distance <= 100 to the point the cursor points to".
There are a lot of index structures that support this type of query for multidimensional spaces even more efficiently. The M-Tree, however, is special in that way that only a distance function, which resturns the distance between two objects, must be known. This distance function d() must have the following properties:
d(a,b) = d(b,a) for all a, b
d(a,b) >= 0 for all a, b
d(a,b) + d(b,c) >= d(a,c) (triangle equation)
A trivial distance function, which was applied here, is the euclidian distance between two points in a 2-dimensional (or multidimensional) vector space: d(a,b) = sqrt((b1-a1)^2 + (b2-a2)^2), where sqrt is the square root and ^2 is the power of two.
But the MTree index structure can also be applied to other data structures, such as strings, with the distance function being the edit distance, or Levenshtein distance, which also follows the triangle equation. I plan to add a demo on that pretty soon.
How the tree looks like
I want to be very brief here since there are already some quite accessible papers on the MTree.
The metric tree contains all data objects in its leaf nodes. It grows bottom-up. The upper nodes contain "routing objects" (red rectangles in the demo) that describe the objects contained in the branches. Each routing object has a so called covering radius of all its enclosing objects (shown as circles in different shades of grey in the demo), and precomputed distances to each child node.
When the root node (or any other nodes) is full, an insert operation will lead to a split of that node. Two of the objects contained in that node are chosen as routing objects, the other Objects are each assigned to one of them, and a new root is created. If the split node is not the root, one of the two new nodes is inserted into the parent node, which can lead to recursive calls of splits if they are overflowing.
When a range search is done, subtrees are pruned if the distance between the query object and the routing object is larger than the routing object's covering radius + the query radius. This and the fact that a lot of distances are already precomputed leads to the relatively high speed of query operations.
Ciaccia et al. evaluated different split strategies. In this implementation, only the rule "minRAD/MinMax" was implemented, which is slower when creating the tree than other strategies, but makes the tree most efficient for querying. The rule says: Those two objects will be promoted to routing objects whose sum of covering radii are the smallest. Therefore, all possible combinations of routing objects and their covering radii must be computed, an operation with complexity O(n^3).
The Demo
The default node size of the demo tree is 5 points. (When you clear the tree, a new tree with 8 points per node is created).
When starting the demo, some points are added to the index automatically, leaving you in a state right before a split.
This means: In the beginning there is only one node. When you add the sixth point, this node is split: Two routing objects appear. If a new point is added (by pressing the left mouse button) it will be added to the nearest node (if within a covering radius) or the node whose covering radius has to be extended the least.
You can add random points by clicking on one of the "100/1000/10000" buttons. Make sure you turn of the display of the tree structure before (also before turning on the search), because it will be slow the whole thing down a lot and will sometime account for 99.9% of the time.
You can watch the system drawing the tree if you press the right mouse button.
What do the circles mean? Imagine you look on a tree from below (you're lying under the earth...). The thicker and the darker the lines are, and the larger the red boxes are, the nearer the routing objects are. Routing objects may be placed on top of each other, in which case you see a lighter circle with a darker border (lower objects are painted later). Each routing object contains all of its childs within the bounding circle.
The "print tree" button prints the tree into a console window. Since a Java plugin applet doesn't have a console, this won't do anything.
When the tree works best (and when not)
The tree works best the more subtrees can be pruned. This will definitely be the case if you have different distinct clusters of objects. The worst case is random points as shown here.
But the MTree is the only known index structure for metric spaces. Its main problem is the overlap between different routing objects in the same level. An extension of the MTree exists, the SlimTree, which aims for reducing this amount of overlap. For n-dimensional spaces, there are better indexing methods, such as the R* tree.
Comparing this Implementation to other ones
The only Java implementation of an MTree that I know of was done in the XXL library project of the University of Marburg. This implementation, however, stresses an "elegant" design over speed. I found myself waiting for 24 hours to cluster 80.000 strings with the dbScan algorithm, which was unacceptable. My choice was to move to C++ (which would have caused me a lot of headache) or translate the C++ code to Java (which I did). I haven't made exact comparisons yet, but from what I have seen I conclude that the implementation shown here is about 10-15 times faster than Marburg's.
Indexing speed: The speed of the index is largely dependent on the speed of the distance function. On my 1000 MHz laptop and a 1.3.1 server VM I was able to add 10.000 points of the above example in about 2.5 seconds, after the JIT compiler was through. The levenshtein application, however, was a lot slower: It needed about 1.7 ms to add one string to the index with a nodesize of 5 (this number grows logarithmically as the tree grows). The time goes up very fast if the node size is increased (about 17 ms with node size 40)
Query speed: For queries with 2-dimensional objects, the query speed is very fast, as you can see with the above example (in the millisecond range even with a lot of objects). A similarity search in a collection of 300.000 strings and the configuration above took between 100 and 700 ms, which is still by far slower than I would have expected. I see room for improvement only in the indexing part.
I will have to compile numbers more thoroughly sometimes to provide more exact results.
Note that these figures can be a little worse in the applet because the client VM is used. Also, the display accounts for much of the delay, since it is recomputed at every mouse move (in search mode) or mouse click.
Weaknesses of the current Java implementation
object deletion is not implemented
nearest neighbor queries are not implemented
the tree does not work with secondary storage. It can only be made persistent alltogether with the Java persistance mechanism
Since there are differences between the Java- and C++-GiST implementations (on which the C++ MTree was based on) much of the GiST framework was not used. Some methods were reimplemented with different semantics, other GiST methods were not tested
The way how the C++ MTee library performed split operations proved not to be very efficient, especially in Java. For every possible combination of points in a routing object, two new objects are created and the keys contained in the original nodes are copied. Then the covering radius of these two objects is calculated, and afterwards they are deleted if they are not candidates. This leads to a high amount of objects that have to be created, one of Java's main sources for performance problems. This could be avoided by using some special form of array that can be reused
the pickSplit function in the C++ implementation had the notion of a distance matrix that was used to minimize the number of distance computations. In the actual split function, however, this distance matrix is not used, and distances are computed from scratch. This could be optimized