TY - BOOK
T1 - Clustering, Weighted Sequences, and Shortest Paths
AU - Kipouridis, Evangelos
PY - 2022
Y1 - 2022
N2 - In this thesis we focus on clustering problems where the input is the ideal relationship between all pairs of objects in the final clustering. More particularly, we concern ourselves with the following problems. L1-fitting tree metrics and ultrametrics: We are given the ideal distance between all pairs of n objects, and the goal is to output a weighted tree (resp. ultrametric) which spans the set of objects and minimizes the sum of pairwise distance errors. Both problems are closely related to evolutionary biology and the reconstruction of the tree of life. In fact, discussions related to the reconstruction of the optimal tree are traced back to Plato and Aristotle (350 BC), in the context of classification. Both problems were known to be APX-Hard and the best known approximation factor was O((log n)(log log n)) by Ailon and Charikar [FOCS ’05]. We design asymptotically optimal constant factor approximations for both problems. Our paper “Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor” appeared in FOCS ’21. Constrained Correlation Clustering: For each pair of objects, we are given a preference related to whether the two objects should be in the same cluster or not. Furthermore, we are also given hard constraints for certain pairs. The output clustering must satisfy all hard constraint, and minimize the number of violated preferences. We design a deterministic combinatorial algorithm with a constant approximation factor. A key ingredient in our approach is a novel nearly-optimal pivoting algorithm for Correlation Clustering. This is a deterministic combinatorial algorithm achieving the best approximation factor among all known deterministic combinatorial algorithms for Correlation Clustering, not just pivoting ones. Part of these results have been submitted to ICALP ’22. Apart from clustering, we also study graph and string problems. Multiple Source Shortest Paths in Planar Graphs: Given an embedded planar digraph with positive edge weights and a face f , we are interested in a data structure supporting shortest path queries, where the source is in f . The best known data structure by Klein [SODA ’05] requires O(n log n) time for preprocessing and O(log n) time for queries, where n is the number of nodes. We improve the preprocessing/query time to O(n log |f |)/O(log |f |), where |f | is the number of nodes in f . More importantly, our approach is much simpler, requiring only single source shortest path computations and contractions. In contrast, Klein’s solution required persistency, dynamic trees, and an interplay between the primal and the dual graph. Our paper “A Simple Algorithm for Multiple-Source Shortest Paths in Planar Digraphs” appeared in SOSA ’22. Longest Common Subsequence on Weighted Sequences: Weighted sequences generalize the concept of strings, so that in each position we have a probability distribution over the alphabet, rather than a single character. The motivation comes from the inherent uncertainty of the actual methods used for “reading” a DNA sequence. We suggest that the alphabet size is a crucial parameter for this problem, and provide optimal results both in the case of bounded and unbounded alphabets. Furthermore, this is the first work on Weighted Sequences avoiding the Log-Probability model, a simplifying assumption related to exact computations of reals. Our paper “Longest Common Subsequence on Weighted Sequences” received the Best Paper Award at CPM ’20.
AB - In this thesis we focus on clustering problems where the input is the ideal relationship between all pairs of objects in the final clustering. More particularly, we concern ourselves with the following problems. L1-fitting tree metrics and ultrametrics: We are given the ideal distance between all pairs of n objects, and the goal is to output a weighted tree (resp. ultrametric) which spans the set of objects and minimizes the sum of pairwise distance errors. Both problems are closely related to evolutionary biology and the reconstruction of the tree of life. In fact, discussions related to the reconstruction of the optimal tree are traced back to Plato and Aristotle (350 BC), in the context of classification. Both problems were known to be APX-Hard and the best known approximation factor was O((log n)(log log n)) by Ailon and Charikar [FOCS ’05]. We design asymptotically optimal constant factor approximations for both problems. Our paper “Fitting Distances by Tree Metrics Minimizing the Total Error within a Constant Factor” appeared in FOCS ’21. Constrained Correlation Clustering: For each pair of objects, we are given a preference related to whether the two objects should be in the same cluster or not. Furthermore, we are also given hard constraints for certain pairs. The output clustering must satisfy all hard constraint, and minimize the number of violated preferences. We design a deterministic combinatorial algorithm with a constant approximation factor. A key ingredient in our approach is a novel nearly-optimal pivoting algorithm for Correlation Clustering. This is a deterministic combinatorial algorithm achieving the best approximation factor among all known deterministic combinatorial algorithms for Correlation Clustering, not just pivoting ones. Part of these results have been submitted to ICALP ’22. Apart from clustering, we also study graph and string problems. Multiple Source Shortest Paths in Planar Graphs: Given an embedded planar digraph with positive edge weights and a face f , we are interested in a data structure supporting shortest path queries, where the source is in f . The best known data structure by Klein [SODA ’05] requires O(n log n) time for preprocessing and O(log n) time for queries, where n is the number of nodes. We improve the preprocessing/query time to O(n log |f |)/O(log |f |), where |f | is the number of nodes in f . More importantly, our approach is much simpler, requiring only single source shortest path computations and contractions. In contrast, Klein’s solution required persistency, dynamic trees, and an interplay between the primal and the dual graph. Our paper “A Simple Algorithm for Multiple-Source Shortest Paths in Planar Digraphs” appeared in SOSA ’22. Longest Common Subsequence on Weighted Sequences: Weighted sequences generalize the concept of strings, so that in each position we have a probability distribution over the alphabet, rather than a single character. The motivation comes from the inherent uncertainty of the actual methods used for “reading” a DNA sequence. We suggest that the alphabet size is a crucial parameter for this problem, and provide optimal results both in the case of bounded and unbounded alphabets. Furthermore, this is the first work on Weighted Sequences avoiding the Log-Probability model, a simplifying assumption related to exact computations of reals. Our paper “Longest Common Subsequence on Weighted Sequences” received the Best Paper Award at CPM ’20.
M3 - Ph.D. thesis
BT - Clustering, Weighted Sequences, and Shortest Paths
PB - Department of Computer Science, Faculty of Science, University of Copenhagen
ER -