The iOrca outlier detection algorithm is based on the K-nearest neighbors approach. It calculates the distances across the dataset to determine if data points are near or far away from their set of neighbors. If a point has an unusually large distance to its neighbors, it is considered an outlier and of interest to the user. Calculating the full set of distances is computationally expensive.
The original Orca code introduced a pruning technique used to cut down the number of operations, and make the algorithm more efficient and run in near-linear time. iOrca builds upon this algorithm by introducing an indexing scheme before running the Orca algorithm, and reordering the data to reduce the time complexity. iOrca's indexing scheme reorders the data to analyze the most anomalous examples first, thus increasing the cutoff value earlier. The cutoff value affects the pruning technique implemented in the original version of Orca. If the cutoff value is increased earlier, the pruning becomes more effective and therefore the runtime is greatly reduced.
The key algorithmic improvements made to the code include 1) building an index upon the data to allow the algorithm to more efficiently process data points and converge to the results much quicker than the previous version of the code, and 2) using the index distances to determine a stopping criteria for early termination. During the indexing process, an initial set of distances is already calculated for all points, and since the data is ordered in descending order, the remaining distances are guaranteed to be smaller. Thus, if a known distance for a given point is less than the cutoff, it is not necessary to calculate the rest of the points because they are guaranteed not to be outliers and therefore the algorithm can complete early.