PostgreSQL

PostgreSQL

Some important points from the PostgreSQL documentation include:

Chapter-1

  • The introduction to the architecture of PostgreSQL in the book "The Internals of PostgreSQL" provides a high-level overview of how the PostgreSQL database management system is structured and how it works.
  • A PostgreSQL server runs on a single host and manages a single database cluster, a collection of databases.
  • Each database objects in PostgreSQL are internally managed by respective object identifiers (OIDs), which are unsigned 4-byte integers.
  • Data is stored in tables as a collection of pages, where each page is a fixed-size block of memory typically 8KB and contains a header and a collection of tuples.
  • Each tuple represents a row in the table and contains the data for each column in the row.
  • The introduction explains the concept of multi-version concurrency control (MVCC) which allows multiple transactions to read and write to the same table simultaneously without conflicts.
  • It also explains the role of indexes in speeding up data retrieval and the importance of performance-related features like tablespaces.

Chapter-2
  • The storage system in PostgreSQL is responsible for managing the physical storage of data on disk. It includes several key components, such as:
  • Pages: Data is stored in tables as a collection of pages, where each page is a fixed-size block of memory, typically 8KB. Each page contains a header and a collection of tuples (rows in the table)
  • Tuples: Each tuple represents a row in the table and contains the data for each column in the row.
  • Heap: The heap is the main storage area for table data and is organized as a collection of pages. The heap stores all the tuples for a table in no particular order.
  • Free Space Map (FSM): Each table has an associated FSM that stores information about the free space available in the heap. This is used to quickly find space to add new tuples to the table.
  • Visibility Map (VM): Each table has an associated VM that stores information about the visibility of each tuple in the heap. This is used to quickly find which tuples are visible to a given transaction and which are not.
  • Indexes: Indexes are used to speed up data retrieval by providing a way to quickly find specific rows in a table based on the values in one or more columns.
  • TOAST (The Oversized-Attribute Storage Technique): TOAST is a technique used to store large data types (like text or binary data) in a separate table, rather than in the main table. This allows the main table to remain small and fast, while still allowing large data types to be stored.
  • The storage system in PostgreSQL is designed to be efficient and highly concurrent, allowing multiple transactions to access and modify the same data at the same time. It uses a technique called Multi-Version Concurrency Control (MVCC) which allows multiple transactions to read and write to the same table simultaneously without conflicts.

Chapter-3

The query processing system in PostgreSQL is responsible for taking SQL statements and executing them to retrieve data from the database. The system follows a set of steps to process a query, including:

  • Parsing: The SQL statement is parsed and checked for syntactic and semantic correctness.
  • Optimization: The parsed statement is optimized to determine the most efficient execution plan. This includes selecting the best indexes to use, joining tables in the most efficient order, and evaluating any subqueries.
  • Execution: The optimized plan is executed, and the results are retrieved from the data storage system.
  • Returning Results: The results are returned to the client application.
  • Caching: Results are also cached in memory so that subsequent queries that use the same execution plan can be executed faster.
PostgreSQL uses a query optimizer called the planner-optimizer, which generates a query plan using a cost-based approach. The planner-optimizer estimates the cost of each possible plan based on the statistics collected on the table and indexes, and then chooses the plan with the lowest estimated cost.

The planner-optimizer uses several techniques to generate the execution plan for a query, including:

  • Seq-scan: Scans the entire table to find the matching rows.
  • Index-scan: Scans the index to find the matching rows, and then retrieves the corresponding rows from the table.
  • Join: Combines the results of two or more tables based on a common column.
  • Sort: Sorts the result set by one or more columns.
  • Aggregate: Groups the result set by one or more columns and performs calculations such as sum, average, and count.
  • Subquery: Executes a subquery and uses the results in the main query.

The query processing system in PostgreSQL is highly configurable and allows for fine-tuning to optimize performance for specific workloads. The planner-optimizer can also use additional information such as statistics and constraints to generate better execution plans.

The index system in PostgreSQL is a key component in the efficient retrieval of data from tables. It includes several types of indexes, each with its own strengths and weaknesses. Here is a more detailed explanation of each of the index types:
  • B-tree indexes: B-tree indexes are the most common type of index in PostgreSQL. They use a balanced tree structure to store data, and are efficient for both equality and range queries. The B-tree index stores data in a sorted order, which allows for fast lookups of specific rows. B-tree indexes are used for most data types, including integers, floating-point numbers, dates, and timestamps.
To create a B-tree index, use the following syntax:

CREATE INDEX index_name ON table_name (column_name);

  • Hash indexes: Hash indexes are used for equality queries on data types that have a hash function defined for them. They are typically faster than B-tree indexes for small tables, but can become less efficient as the table grows. Hash indexes are best suited for small tables with a high number of distinct values in the indexed column. 
To create a Hash index, use the following syntax: 
 
CREATE INDEX index_name ON table_name USING HASH (column_name);
  • GiST (Generalized Search Tree) indexes: GiST indexes are used for complex data types, such as geometric or text data. They are efficient for both equality and range queries, and can be used to support full-text search and spatial data queries. GiST indexes use a tree-based structure to store data, similar to a B-tree, but with the added flexibility to handle more complex data types. 
To create a GiST index, use the following syntax:

CREATE INDEX index_name ON table_name USING GIST (column_name);

  • SP-GiST (Space-partitioned Generalized Search Tree) indexes: SP-GiST indexes are a variation of GiST indexes that are optimized for data types with a large number of distinct values and efficient for both equality and range queries. They are particularly well suited for data types with a high number of distinct values such as IP addresses, MAC addresses and UUIDs.
To create a SP-GiST index, use the following syntax:   
 
CREATE INDEX index_name ON table_name USING SP-GIST (column_name);
  • GIN (Generalized Inverted Index) indexes: GIN indexes are used for complex data types such as full-text search and arrays. They are efficient for containment queries, but not for range queries. GIN indexes use an inverted index data structure, which allows for fast lookups of specific values within a column. 
To create a GIN index, use the following syntax:

CREATE INDEX index_name ON table_name USING GIN (column_name);
  • BRIN (Block Range INdex) indexes: BRIN indexes are used for large tables with data that is naturally sorted by a certain column, such as time-series data. They are efficient for range queries on the indexed column, but not for equality queries. BRIN indexes use a combination of a B-tree and a bitmap to store data, which allows for fast lookups of specific ranges of data within a column.
To create a BRIN index, use the following syntax: 

CREATE INDEX index_name ON table_name USING BRIN (column_name);

In PostgreSQL, the two main ways to retrieve data from a table are by using a sequential scan (seq-scan) or an index scan (index-scan).

Sequential Scan:

A seq-scan, also known as a table scan, is the process of reading every row in a table sequentially to find the matching rows for a query. The seq-scan reads the entire table, regardless of the size of the table, to find the rows that match the query. This can be a slow process for large tables with millions of rows. However, it can be faster for small tables or for queries that return a high percentage of the rows in the table.

Index Scan:

An index-scan, on the other hand, uses an index to quickly locate the matching rows for a query. An index is a separate data structure that contains a copy of the data from one or more columns of a table, along with a pointer to the location of the corresponding row in the table. When a query is executed, the index-scan reads the index rather than the table, and then uses the pointers to retrieve the matching rows. This can be faster than a seq-scan for large tables, or for queries that return a small percentage of the rows in the table.

It is worth noting that both seq-scan and index-scan have their own advantages and disadvantages. The seq-scan is best suited for queries that return a high percentage of the rows in the table. The index-scan, on the other hand, is best suited for queries that return a small percentage of the rows in the table, or for queries that filter data on indexed column.

To decide which method to use, PostgreSQL uses a query optimizer called the planner-optimizer, which generates a query plan using a cost-based approach. The planner-optimizer estimates the cost of each possible plan based on the statistics collected on the table and indexes, and then chooses the plan with the lowest estimated cost.

In summary, a seq-scan reads every row in a table to find the matching rows for a query, while an index-scan uses an index to quickly locate the matching rows. The choice of which method to use depends on the specific use-case and the nature of the data stored in the table. It is important to note that the query processing system in PostgreSQL is highly configurable and allows for fine-tuning to optimize performance for specific workloads.

When creating a table, it's important to consider which columns should be indexed and how the data will be queried to ensure optimal performance. By indexing the columns that are frequently used in WHERE clauses, the database can quickly retrieve the matching rows without having to scan the entire table.

It's also important to keep in mind that indexes do come with some overhead. They take up additional space on disk, and can slow down the performance of INSERT, UPDATE and DELETE statements. So, it's a balance between the performance gain from using an index-scan and the overhead of maintaining the index.

In general, it's a good practice to index columns that have high cardinality, meaning that they have a large number of distinct values. These columns are likely to be used in WHERE clauses and will benefit from having an index. It's also important to keep an eye on the size of the table and indexes, as they can grow over time, and this can affect the performance of the queries.

In short, the choice of whether to use a seq-scan or index-scan depends on the specific use-case, the nature of the data stored in the table and the nature of the queries being executed. It's important to understand the trade-offs and make informed decisions about which columns to index and when to use a seq-scan or index-scan.

Comments

Popular posts from this blog

Community Detection Algorithm (Leiden Algorithm in Python)

Stack and Queues

How To Earn Money Online