-
When processing large amounts of data on a computer,
-
need to be minimized.
-
When trying to reduce computation time, it is common to create indexes.
- This increases memory usage.
- To improve this trade-off, the data structure needs to be designed carefully.
-
Methods to efficiently perform string searches (compression of indexes)
- Suffix tree (a tree structure)
- Stores a list of suffixes in a tree structure where each node represents one character.
- It is efficient but uses a lot of memory.
- Suffix array
- A simple array that contains all suffixes sorted in lexicographical order.
- It is lighter than a suffix tree.
- Compressed suffix array
- Features
- Can retrieve the i-th element of the suffix array and its reverse array in time.
- Can retrieve a substring of a string in time.
- With these two features, the position of a suffix that is one character shorter than a given suffix can be determined.
- Together with the original string T, this enables quick searches.
- Features
- Suffix tree (a tree structure)
-
-
A representation that almost matches the lower bound of information theory in terms of size.
-
If there are L possible inputs, the lower bound is approximately bits.
-
Succinct index
- A bit vector/array (e.g., {1,0,0,1,1,0,1}) with
- rank operation: returns the number of 1s in a given range
- Succinct index: Divides the vector into blocks to minimize its size and precomputes the answers to queries for each block?
- (blu3mo) I’m not sure if I understand this correctly.
- Succinct index: Divides the vector into blocks to minimize its size and precomputes the answers to queries for each block?
- select operation: returns the position of the i-th 1.
- rank operation: returns the number of 1s in a given range
- A bit vector/array (e.g., {1,0,0,1,1,0,1}) with
-
Succinct representation of trees
- For ordered trees (trees with a defined order among siblings)
- LOUDS representation
- Encodes the degrees of nodes in a breadth-first order using unary coding.
- Unary coding: represents a number using a sequence of consecutive 1s.
- Why? —> 0 is used as a separator.
- This becomes a bit vector, so various tree operations can be performed using the rank and select operations of succinct index.
- BP representation
- A representation like
(()((())())())(()())())
. - Represents the hierarchical structure using parentheses, where
(
represents 0 and)
represents 1. - To perform operations like findclose, a data structure called interval min-max tree is used.
- It divides the structure into blocks and stores the maximum and minimum values for each block.
- This allows the construction of a tree structure within each block.
- This enables fwd_search.
- A representation like
- LOUDS representation
- For ordered trees (trees with a defined order among siblings)
-
(Information Science Expert)