elasticsearch báo cáo cuối kì

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.54 MB, 46 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

<b>HANOI UNIVERSITY OF SCIENCEMATHEMATICS FACULTY</b>

Credit class: MAT3507

Nguyễn Sỹ Nam Trung (20001588) Vũ Đức Duy (22001246) Phạm Hữu Hưng (22001597)

Tutor : Mr.Vũ Tiến Dũng

Ha Noi,

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

We are living in the era of science and technology, where everything - especially artificial intelligence - is developing rapidly. Human civilization is also advancing, leading to an increasing demand for human life, education, work, and other aspects. In addition, the 25% annual increase in the number of people participating in the field of science and technology engineering has rapidly increased the number of scientists; as a result, the amount of scientificliterature and research products has increased significantly. All of this has created a huge and constantly growing amount of information, leading to an information explosion.

To manage the ever-increasing and complex data, software and digital products have been developed. And ElastichSearch - a powerful search application - has been chosen by us as the topic of our report, as a contribution to expanding access to knowledge to serve the urgency ofthe above problem.

Thank you to Mr. Vu Tien Dung for providing us with the basic knowledge, foundation, guidance on how to conduct the report, and directing our scientific thinking and work. All of your teachings are important knowledge that helps us complete this essay.

Thanks to the whole group for always trying hard and being responsible for the work.Due to the limitations of our knowledge and theoretical abilities, there are still many shortcomings and limitations in our work. We sincerely hope that you can provide guidance and contribute to making this report more complete. Thank you very much!

<b>Bigdata1. The concept of Big Data</b>

Big Data is one of the terms commonly used to represent a collection of large and complex data that traditional processing tools and applications cannot manage, process and collect. These large data sets can often include structured data, unstructured data, and semi-structured data. So each set will be a little different.

Based on reality, large amounts of data and large projects will be in the exabyte range. Big Data is highly appreciated for possessing the following 3 characteristics:

The data blocks are extremely large.Possessing many other diverse types of data.

There is a velocity at which data needs to be processed and analyzed.

This data will be created into larger data warehouses and can come from sources including: websites, social media, desktop applications, device applications. , scientific and technical experiments and even in other internet-connected network devices.

</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3">

<b>2. The origin of Big Data</b>

According to many users, Big Data is considered one of the terms that started in the 1960s and 1970s. This was the time when the world of data only started from the first data centers tocombine with it itself. is the development of SQL databases.

In 1984, the DBC 1012 parallel data processing system was born by Teradata Corporation. This is one of the first systems capable of analyzing and storing 1 terabyte of data. By 2017, there were dozens of databases located on the Teradata system with high capacity resources ofup to petabytes.

Among them, the largest amount of data has exceeded the threshold of 50 petabytes. In 2005, when people began to realize the number of users created through Youtube, Facebook and other online services was limitless. same big.

During this time, NoSQL is also increasingly used and supports the development of frameworks because it is necessary to promote the development of Big Data. According to users, these frameworks support Big Data, making it easier to store and operate.At present, the volume of Big Data is gradually increasing more rapidly, so users are gradually creating an extremely large amount of data. However, this data is not only for humans but is also created by machines. In addition, the advent of IoT with many other devices makes it easier for users to use as well as improves product performance.

<b>3. Characteristics of Big Data</b>

Here are some outstanding features of Big Data:

<b>1. Volume:</b>

Combined with big data to process low-fat and unstructured data. This data is of unknownvalue such as: providing Twitter data, performing clicks on websites or using applications for mobile devices.

For some other organizations, this is considered tens of terabytes of data or hundreds of petabytes.

<b>2. Velocity: </b>

This is considered how fast the data source can receive and can be acted upon. According to experts, the highest speed of the data source is usually directly to memory compared to writing to disk. Some other smart products feature support for the internet or some real-time operations and are more closely suited to the requirements for evaluation and other real-time operations.

<b>3. Variety: </b>

It possesses categories that can refer to more than other available data types. Traditional data types are different and often have a more consistent and compact structure in some databases of other types of technology. These types of data are unstructured or more commonly semi-structured to require additional processing to be able to derive meaning from other supporting metadata.

These large data warehouses are created from data and this data can come from a number of sources such as applications right on mobile devices, some applications for desktop computers, social networks. , website,... There are also scientific experiments, some other sensors and other devices inside the internet (IoT).

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

This Big Data, if accompanied with relevant components, will allow organizations to put data into practical use and solve business-related problems. Problems that Big Data can solve include:

Perform analysis as well as apply it to other data.IT infrastructure to support Big Data.

The technologies needed for Big Data projects include other related skill sets. Real cases related to Big Data.

When performing data analysis, the value brought from organizations is extremely large and if not analyzed, it is only data of limited use in the business field. When analyzing big data, businesses will gain a number of benefits related to customer service.

From there, it brings more efficiency to the business as well as increases competitiveness and revenue for the company.

<b>4. The role of big data in business</b>

<b>Targeting customers correctly: Big Data's data is collected from many different sources, </b>

including social networks.. It is one of the channels frequently used by users. Therefore, Big Data analysis businesses can understand customer behavior, preferences, and needs while classifying and selecting the right customers suitable for the business's products and services.

<b>Security prevention and risk mitigation: Big Data is used by businesses as a tool to probe, </b>

prevent and detect threats, risks, theft of confidential information, and system intrusions.

<b>Price optimization: Pricing any product is an important thing as well as a challenge for </b>

businesses because the company needs to carefully research customer needs and price levels. price of that product from competitors.

<b>Quantify and optimize personal performance: Due to the emergence of smart mobile </b>

devices such as laptops, tablets, or smartphones, collecting information and personal data has become easier than ever. run out of. Collecting personal data from users will help businesses have a clear view of the trends and needs of each customer. This will help businesses map out future development directions and strategies

<b>Capture financial transactions: Financial interfaces on websites or e-commerce apps are </b>

increasing due to the strong development of e-commerce worldwide. Therefore, Big Data algorithms are used by businesses to suggest and make transaction decisions for customers.

<b> V. Operational process of Big Data 1. Building a Big Data strategy</b>

Big data strategy is considered a plan designed to help you monitor and improve data collection, storage, management, sharing and usage for your business. When developing a BigData strategy, it is important to consider the current and future goals of businesses.

<b> 2. Identify Big data sources</b>

Data from social networks: Big Data in the form of images, videos, audio and text will be very useful for marketing and sales functions. This data is often unstructured or semi-structured.

Published available data: Is information and data that has been publicly announcedLive streaming data: Data from the Internet of Things and connected devices transmits into information technology systems from smart devices.

Other: Some data sources come from other sources

<b> 3. Access, manage and store Big Data</b>

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

Modern computer systems provide the speed, power and flexibility needed to quickly access data. Along with reliable access, companies also need methods for data integration, quality assurance, and the ability to manage and store data for analysis.

<b> 4. Conduct data analysis</b>

With high-performance technologies such as in-memory analytics or grid computing, businesses will choose to use all of their big data for analysis. Another approach is to predetermine what data is relevant before analysis. Big data analytics is how companies gain value and insights from data.

<b> 5. Make decisions based on data</b>

Reliable, well-managed data leads to informed judgments and decisions. To stay competitive, businesses need to grasp the full value of big data and data-driven operations to make decisions based on well-proven data.

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

<b>Introduction Of Elasticsearch</b>

Elasticsearch is an open-source search engine that has become increasingly popular in modern data management. It is designed to handle large amounts of data and provides a powerful query language and real-time search capabilities. In this essay, we will explore the definition and overview of Elasticsearch and the importance of Elasticsearch in modern data management.

<b>5. Definition and Overview of Elasticsearch</b>

Elasticsearch is a search engine that is built on top of the Apache Lucene search engine library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is designed to be highly scalable and can handle petabytes of data across thousands of servers.

One of the key features of Elasticsearch is its ability to handle complex queries. Elasticsearch provides a powerful query language that allows users to search for specific terms, phrases, or patterns within their data. This makes it easy for users to find the information they need quickly and efficiently.

Another key feature of Elasticsearch is its real-time search capabilities. Elasticsearch is designed to provide real-time search results, which means that users can get up-to-date information as soon as it becomes available. This makes Elasticsearch ideal for applications that require real-time data analysis, such as e-commerce sites, social media platforms, and financial services.

<b>6. Importance of Elasticsearch in Modern Data Management</b>

Elasticsearch is becoming increasingly important in modern data management because of its ability to handle large amounts of data and provide real-time search results. With the explosion of data in recent years, traditional databases are no longer sufficient to handle the volume and complexity of data that organizations are generating. Elasticsearch provides a scalable and efficient solution for managing large amounts of data.

In addition, Elasticsearch is highly flexible and can be used in a wide range of applications. It can be used for log analysis, e-commerce search, geospatial analysis, and

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

more. This versatility makes Elasticsearch an attractive option for organizations that need a search engine that can handle a variety of use cases.

<b>7. Conclusion</b>

In conclusion, Elasticsearch is an open-source search engine that is designed to handle large amounts of data. It provides a powerful query language and real-time search capabilities,making it ideal for modern data management. Its scalability, efficiency, and flexibility make itan attractive option for organizations that need to manage large amounts of data across multiple servers. As data continues to grow in volume and complexity, Elasticsearch is likely to become even more important in modern data management.

<b>Key features of elasticsearch</b>

Several outstanding features of Elasticsearch mark its strong influence on data processing:

<b>1. Full-text search capabilities</b>

- Full-text search is a technique for searching information in a database, allowing you to search for words or phrases throughout the content of a document. This technique makes finding information easier and faster. Simply put, full text search (FTS for short) is the most natural way to search for information, just like Google, we just need to type in keywords and press "enter" to get results.

- Specifically, Elasticsearch supports searching and evaluating the relevance of text based on content, using the TF-IDF (Term Frequency - Inverse Document Frequency) technique to evaluate the importance of words in the text. In addition, the key factor that determines the difference between Full-text search technique (in Elasticsearch in particular and other search engines in general) and other common search techniques is "Inverted Index".

<b>2. Distributed architecture and scalability</b>

- Elasticsearch uses a distributed architecture that allows large volumes of data to be stored, queried, and processed across multiple nodes in a cluster. At the same time, data is indexed and stored in shards, distributed across nodes to simplify maintenance, improve scalability and fault tolerance.

- Scalability: As the database grows, it becomes difficult to search. But Elasticsearch scales as your database grows, so search speeds don't slow down. Elasticsearch allows for high-scaleexpansions up to thousands of servers and holds Petabytes (PB) of data. Users do not need to manage the complexity of distributed design because features are implemented automatically.

<b>3. Near real-time data processing</b>

- Elasticsearch has near real-time search, which means that document changes are not visible to search immediately, but will become visible within one second. Compared to other features, near real-time search capability is undoubtedly one of the most important features in Elasticsearch

- The critical point that determines near real-time data processing feature is all shards is refreshed automatically once every second by . Besides, ElasticSearch has fault tolerance and high resiliency. Its clusters have master and replica shards to provide failover in case of node

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

going down. When a primary shard goes down, the replica takes its place. When a node leavesthe cluster for whatever reason, intentional or otherwise, the master node reacts by replacing the node with a replica and rebalancing the shards. These actions are intended to protect the cluster against data loss by ensuring that every shard is fully replicated as soon as possible.In short, if a shard fails, its replica will be used instead. Therefore, ElasticSearch is very effective when working with big data, with hundreds or even thousands of records in near realtime.

**However, it is not always necessary to use Elasticsearch. If your project doesn't require fastsearch, complex data analysis, or big data management, there may be other solutions that are more suitable. Tool selection depends on the specific requirements of the project and available resources.

<b>4. Support for various data types and formats</b>

- Elasticsearch supports a variety of field data types, such as:

Textual: Textual data, such as title, content, summary, comments, reviews, etc.Digital: Numerical data, such as price, score, time, etc.

Geospatial: Geographic data, such as coordinates, addresses, distances, etc.Structured: Structured data, such as tables, charts, lists, etc.

Unstructured: Unstructured data, such as images, videos, sounds, etc.

- Elasticsearch operates as a cloud server, response format is JSON and is written in Java so itcan operate on many different platforms, and is interactive and supports many programming languages such as Java, JavaScript, Node.js , Go, .NET (C#), PHP, Perl, Python, Ruby. (Therefore, the security is not high)

- Elasticsearch uses JSON to process requests and responses, so you need to convert other data to JSON format before storing or querying. This is also a disadvantage of ES compared to other search engines also built on Lucene such as Apache Solr (supports JSON, CSV, XML).

<b>Architecture of Elasticsearch</b>

Elasticsearch is a distributed search and analytics engine that is widely used for indexing, searching, and analyzing large volumes of data. Its architecture is designed to be highly scalable and fault-tolerant. Here's an overview of the key components of Elasticsearch's architecture:

<b>1. Nodes, Clusters, and Shards:</b>

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

- Coordinating Node: Routes client requests to the appropriate data nodes, providing a more efficient way to interact with the cluster

<b>Certainly, let's explore the concept of nodes in Elasticsearch with a concrete example:</b>

Example of Nodes in Elasticsearch:

Suppose you are setting up an Elasticsearch cluster to store and search log data for a web application. In this example, you have three nodes in your Elasticsearch cluster.

a. Node 1: - Role: Data Node

- Hardware: Physical server with high storage capacity

- Responsibilities: Stores and manages the log data, handles indexing and search operations. - Configuration: It is configured as a dedicated data node to ensure optimal storage and search performance for log data.

- Hardware: Virtual machine with moderate resources

- Responsibilities: Primarily handles incoming client requests, routing them to the appropriate data nodes. It doesn't store data but improves the efficiency of search operations. - Configuration: It is configured as a coordinating node to reduce the load on data nodes andfacilitate efficient query routing.

All three nodes share a common cluster name, for instance, "web-logs-cluster," which means they form a single Elasticsearch cluster. This cluster enables you to distribute and search your web application log data efficiently.

Data nodes (Node 1 and Node 2) handle the actual storage and retrieval of log data, while the coordinating node (Node 3) ensures that client requests are efficiently routed to the appropriate data nodes. In the case of any cluster management tasks, such as creating new indices or managing shard allocation, the master-eligible node (Node 2) assists with these operations.

This example illustrates how Elasticsearch's distributed architecture uses nodes with different roles to work together in a cluster, allowing for scalability, fault tolerance, and efficient data management and search capabilities.

<b>1.2. Cluster:</b>

In Elasticsearch, a cluster is a fundamental component of its architecture. It represents a group of interconnected nodes that work together to store and manage data, ensuring high

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

availability, scalability, and fault tolerance. Here's a detailed explanation of a cluster in Elasticsearch:

- If the master node fails, another node is automatically elected as the new master to ensure cluster continuity.

g.Fault Tolerance:

- Elasticsearch clusters are designed for high availability and fault tolerance. Data is replicated by creating replica shards for each primary shard, allowing for data recovery in the event of node failures.

- Data distribution across nodes further enhances fault tolerance.h. Scalability:

- Elasticsearch clusters are scalable and can handle growing data volumes and query loads. You can add more nodes to a cluster to increase its capacity and performance.

- Elasticsearch's distributed nature enables it to scale horizontally as new nodes are added.i. Search and Query Distribution:

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

- When you send a query to a cluster, coordinating nodes route the query to the relevant datanodes, distributing the search workload across multiple nodes for improved query

<b>Certainly, let's provide a concrete example of a cluster in Elasticsearch to illustrate how it works:</b>

Example of a Cluster in Elasticsearch:

Imagine you are managing an e-commerce website, and you decide to implement Elasticsearch to power the search functionality for your online store. You set up an Elasticsearch cluster to handle the product data, and it consists of multiple nodes working together.

Cluster Name: Your Elasticsearch cluster is named "ecommerce-products-cluster." Allnodes in the cluster must use this common cluster name.

1.2.1. Nodes:

a. Node 1:

- Role: Data Node and Master-eligible Node - Hardware: Physical server with high storage capacity

- Responsibilities: Stores product data, manages indexing and searching, and assists with cluster management tasks.

- Configuration: This node is both a data node and a master-eligible node to efficiently manage both data and cluster operations.

b. Node 2: - Role: Data Node

- Hardware: Virtual machine with moderate resources

- Responsibilities: Stores product data and handles indexing and searching operations. - Configuration: This node is focused on data storage and retrieval.

c. Node 3:

- Role: Coordinating Node

- Hardware: Virtual machine with moderate resources

- Responsibilities: Serves as a coordinating node to efficiently route client search requests tothe appropriate data nodes.

- Configuration: It doesn't store data but optimizes query routing.Data Distribution:

- Product data, including product descriptions, prices, and availability, is divided into smaller units called shards. You have defined an index called "products" with 5 primary shards and 1 replica for each primary shard. This means you have 5 primary shards and 5 replica shards forthe "products" index.

- The shards are distributed across the data nodes, ensuring data redundancy and improving query performance.

Cluster State:

- The cluster maintains a cluster state that tracks the health of the cluster, shard allocation, andnode statuses. It continuously updates as nodes join or leave the cluster, and as data is indexedor queried.

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

Master Node:

- Among the nodes, one of them is automatically elected as the master node. In this example, Node 1 is elected as the master node. It is responsible for managing cluster-wide operations, such as creating new indices and managing shard allocation.

Search and Query Distribution:

- When customers search for products on your website, the coordinating node (Node 3) efficiently routes their queries to the relevant data nodes (Node 1 and Node 2), ensuring quickand accurate search results.

This example demonstrates how Elasticsearch clusters are used to efficiently manage, store, and search product data for your e-commerce website. Elasticsearch's distributed architecture,with its cluster, nodes, and shards, ensures scalability, fault tolerance, and high-performance search capabilities.

In summary, an Elasticsearch cluster is a collection of nodes that work together to store, manage, and search data. Clusters provide high availability, scalability, and fault tolerance, making Elasticsearch a powerful tool for handling large volumes of data and delivering efficient search and analytics capabilities. The cluster ensures that data is distributed across nodes, and the master node oversees cluster-wide operations, while the data nodes handle the storage and retrieval of data.

<b>1.3. Shard:</b>

In Elasticsearch, a shard is a fundamental unit of data storage and distribution. It plays a central role in the distributed nature of the system, enabling horizontal scalability, data redundancy, and parallel processing. Here's a detailed explanation of a shard in Elasticsearch:

b. Replica Shard:

- For each primary shard, Elasticsearch can create one or more replica shards. - Replica shards serve as copies of the primary data, providing data redundancy and enhancing query performance. They act as failover instances in case of primary shard failures. - The number of replica shards can be adjusted after index creation, allowing you to control the level of redundancy and query performance.

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

c. Data Distribution:

- Shards are distributed across the nodes in an Elasticsearch cluster. This distribution ensures that data is spread evenly and efficiently across the cluster, allowing for horizontal scalability and improved performance.

d. Parallelism:

- The presence of multiple shards enables parallel processing of data and queries. When yousearch for data, queries can be distributed across all available shards, leading to faster search responses.

e. Data Resilience:

- The combination of primary and replica shards provides data resilience. If a node or a primary shard fails, Elasticsearch can promote a replica shard to primary, ensuring that data remains available and reliable.

f. Query Performance:

- Replica shards can be used to balance query loads. Elasticsearch can distribute queries to both primary and replica shards, improving the query performance by parallelizing the search process.

g. Index Creation:

- When you create an index in Elasticsearch, you specify the number of primary shards and replica shards as part of the index settings. For example, you might create an index with 5 primary shards and 1 replica for each primary shard.

h. Shard Allocation:

- Elasticsearch's shard allocation algorithm takes into account factors like shard size, node health, and data distribution to decide where to place shards for optimal cluster performance and data balance.

In summary, shards in Elasticsearch are the building blocks for storing and distributing data across the nodes in a cluster. They enable data scalability, redundancy, and parallelism, ensuring that Elasticsearch can efficiently manage and search large volumes of data while providing high availability and fault tolerance. The careful allocation and configuration of primary and replica shards are essential aspects of optimizing Elasticsearch's performance andresilience.

<b>Example 1:</b>

<b>Certainly, let's provide a concrete example of how shards work in Elasticsearch:</b>

Example of Shards in Elasticsearch:

Imagine you have a blogging platform where users can write and publish articles. You've decided to use Elasticsearch to power the search and retrieval of these articles, and you've created an index named "blog-articles" to store the articles. In this index, you've configured the number of primary shards and replica shards to illustrate how shards function:

1. Primary Shards:

- You decide to create the "blog-articles" index with 4 primary shards.

- When an article is published, Elasticsearch's indexing process stores it in one of these primary shards. Each primary shard is responsible for a subset of the data.

2. Replica Shards:

- For each primary shard, you've configured 1 replica shard.14

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

- As a result, you have an equal number of replica shards (4) that mirror the data stored in the primary shards.

- Suppose a node in your cluster experiences a failure, causing the primary shard 2 to become temporarily unavailable.

- Elasticsearch's shard allocation mechanism detects this and promotes one of the replica shards (e.g., replica shard 2) to be the new primary shard for data continuity.

7. Query Performance:

- Queries from users are also distributed to replica shards.

- This parallelism enhances query performance, as both primary and replica shards participate in serving search requests.

8. Scaling:

- As your blogging platform grows, you can scale your Elasticsearch cluster by adding more nodes. - Elasticsearch's shard distribution ensures that new data is spread across the new nodes, supporting your platform's scalability.

In this example, the "blog-articles" index is divided into primary and replica shards, allowing Elasticsearch to distribute data efficiently, achieve fault tolerance, and enhance search performance. The distribution of shards across the nodes in the cluster ensures that your blogging platform can handle the growing volume of articles while providing a robust and responsive search experience for users.

<b>Example 2: </b>

Certainly, here's an example of shards in Elasticsearch with code snippets illustrating how to create an index with a specified number of primary and replica shards in Elasticsearch using the Elasticsearch REST API. In this example, we'll use the `curl` command-line tool to send HTTP requests to an Elasticsearch cluster.

# Create an index named "blog-articles" with 4 primary shards and 1 replica per primary shardcurl -X PUT "http://localhost:9200/blog-articles" -H "Content-Type: application/json" -d '{ "settings": {

"number_of_shards": 4, # Number of primary shards "number_of_replicas": 1 # Number of replica shards }

In this code example:

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

- We use the `curl` command to make an HTTP PUT request to create an index named articles" in Elasticsearch.

"blog-- Within the request body, we specify the settings for the index, including the number of primary shards and the number of replica shards. In this case, we have set 4 primary shards and 1 replica per primary shard.

Once you execute this `curl` command, Elasticsearch will create the "blog-articles" index withthe specified shard settings. You can then proceed to index and search documents within this index. The primary and replica shards for the index will be automatically allocated across the nodes in your Elasticsearch cluster.

<b>1.4. Master and Data Nodes:</b>

1.4.1. Master Node:

In Elasticsearch, the "master node" is a critical component of the cluster responsible for managing cluster-wide operations and maintaining the overall state of the Elasticsearch cluster. Here's a detailed explanation of the master node in Elasticsearch:

Role of the Master Node:a. Cluster Coordination:

- The master node is responsible for cluster coordination and management. It ensures that allnodes within the cluster are working together cohesively.

b. Index and Shard Management:

- It handles the creation, deletion, and management of indices within the cluster. When you create a new index, the master node decides how to distribute its primary and replica shards across data nodes.

c. Shard Allocation:

- The master node oversees shard allocation, determining which data nodes will store primary and replica shards. This includes optimizing data distribution for load balancing and fault tolerance.

d. Cluster State:

- The master node maintains the cluster state, which includes information about the nodes inthe cluster, their roles, the indices, shard allocation, and other critical cluster metadata.e. Election of New Master:

- In the event of a master node failure, the remaining master-eligible nodes initiate a leader election process to choose a new master node. This ensures that cluster operations continue smoothly.

<b>Master-eligible Nodes:</b>

a. Master-eligible Nodes:

- Not all nodes in an Elasticsearch cluster are master nodes. Instead, some nodes are designated as "master-eligible" nodes, meaning they have the potential to be elected as the master node.

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

b. Configuring Master-eligibility:

- In Elasticsearch, you can configure whether a node is master-eligible by adjusting its configuration in the `elasticsearch.yml` file

c. Odd Number of Master-eligible Nodes:

- It's recommended to have an odd number of master-eligible nodes (usually 3 or 5) to avoidsplit-brain scenarios and ensure a stable master election process.

<b>Cluster Health and Stability:</b>

a. Cluster Monitoring:

- The master node plays a crucial role in monitoring the health and status of the cluster, including the availability of data nodes, indices, and shards.

b. Cluster State Recovery:

- In the case of failures or issues in the cluster, the master node is responsible for taking corrective actions, such as reassigning shards or promoting replica shards to primaries to maintain cluster health.

In summary, the master node in Elasticsearch is a crucial component responsible for overseeing cluster-wide operations, maintaining cluster health, and ensuring the proper coordination and functioning of the Elasticsearch cluster. It plays a vital role in maintaining the stability and reliability of the cluster's operations. Proper configuration and monitoring of master-eligible nodes are essential for cluster resilience and optimal performance.

1.4.2. Data Node:

In Elasticsearch, a "data node" is a type of node in the cluster that primarily stores and manages data. Data nodes are responsible for holding and indexing the actual data, and they play a crucial role in the distributed storage and retrieval of information. Here's a detailed explanation of data nodes in Elasticsearch:

<b>Role of Data Nodes:</b>

c. Search Operations:

- Data nodes participate in query and search operations. They perform the actual search across the indexed data and return the relevant results to the coordinating node (or directly to the client if there is no coordinating node involved).

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

a. Node Configuration:

- In Elasticsearch, you can configure a node to serve as a data node by adjusting its settings in the `elasticsearch.yml` configuration file. A typical configuration for a data node might look like this:

```yaml

node.master: false # This node is not a master node node.data: true # This node is a data node ```

- The `node.master` and `node.data` settings determine whether a node can be elected as a master node or function as a data node.

b. Dedicated Data Nodes:

- In larger Elasticsearch clusters, it's common to have dedicated data nodes that exclusively handle data storage and retrieval. These nodes do not participate in cluster-wide management tasks.

<b>Data Distribution and Shards:</b>

1. Shard Allocation:

- Data nodes are responsible for holding primary shards and their corresponding replica shards. Elasticsearch's shard allocation process ensures that these shards are distributed acrossdata nodes for fault tolerance and load balancing.

2. Data Redundancy:

- By having multiple copies of data (primary and replica shards), data nodes provide data redundancy. If a data node fails, replica shards can be promoted to primary, ensuring data availability.

<b>Scalability and Performance:</b>

1.4.3. Data Nodes vs. Master Nodes

1. Dedicated Master Nodes:

- In larger Elasticsearch clusters, it's common to have dedicated master nodes that do not store data but focus solely on cluster management and coordination.

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

1. Create an Index:

- Before you can index data, you need to create an Elasticsearch index to define the structure and settings for the data you want to store. You specify the number of primary shards and replica shards for the index.

2. Document:

- A document in Elasticsearch is the basic unit of information. It is usually represented in JSON format and can contain various fields with data.3. Document ID:

- Each document in an index is uniquely identified by a document ID. You can provide your own custom document ID or let Elasticsearch automatically generate one.

4. Indexing a Document:

- When you want to store a document in Elasticsearch, you send an HTTP request to the Elasticsearch cluster with the JSON document data. Here's a simplified example using the `curl` command-line tool:

```bash

curl -X POST "http://localhost:9200/my-index/_doc/1" -H "Content-Type: application/json" -d '{

"field1": "value1", "field2": "value2" }'

```

- In this example, we are sending a JSON document to the "my-index" index with a custom document ID of "1."

5. Routing and Shards:

- Elasticsearch uses routing to determine which shard within the index will store the document. This routing is usually based on the document ID or a routing key.6. Shard Storage:

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

- The document is stored in the appropriate primary shard. If replication is configured, a copy of the document is also stored in the replica shard for redundancy.7. Data Distribution:

- Elasticsearch distributes documents across primary and replica shards, ensuring even data distribution across the cluster.

8. Retrieval and Searching:

- After indexing, you can search for documents within the index using search queries. Elasticsearch's powerful search capabilities allow you to retrieve documents based on various criteria.

9. Updates and Deletions:

- You can update existing documents or delete them by sending the appropriate HTTP requests with the document ID.

Indexing is a fundamental operation in Elasticsearch, enabling you to store, search, and retrieve data efficiently. The process is designed to be scalable, fault-tolerant, and suitable for a wide range of use cases, from simple document storage to complex data analysis and search applications.

1.5.2. Querying:

Querying in Elasticsearch involves searching for and retrieving documents from an Elasticsearch index based on specific criteria or search queries. Elasticsearch provides a powerful and flexible query system that allows you to perform full-text searches, structured queries, aggregations, and more. Here's an overview of how querying works in Elasticsearch:a. Creating a Query:

- To perform a query, you construct a query request specifying the search criteria and conditions you want to apply to your data.

b. Query DSL:

- Elasticsearch uses a Query DSL (Domain Specific Language) that allows you to express complex queries. The DSL includes a wide range of query types and filter options to meet your specific needs.

c. Types of Queries:

- Elasticsearch offers various types of queries, including:20

</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">

- Full-text Queries: These are used for searching full-text content within documents. Common options include match, match_phrase, and multi_match queries.

- Term-Level Queries: Used for exact matching of terms within documents. Examples include term, terms, and prefix queries.

- Compound Queries: These queries combine multiple subqueries to create complex conditions. Examples include bool and dis_max queries.

- Geo Queries: Used for location-based searches.

- Joining Queries: Useful for searching documents with parent-child relationships.d. Executing the Query:

- You send the query as an HTTP request to the Elasticsearch cluster, specifying the index or indices you want to search within.

```bash

curl -X GET "http://localhost:9200/my-index/_search" -H "Content-Type: application/json" -d '{

"query": { "match": { "field1": "value" }

} }' ```

In this example, we're using a simple match query to search documents in the "my-index" index.

g. Pagination and Sorting:

- You can use pagination and sorting options to control the order of results and navigate through large result sets.

h. Aggregations:

- Elasticsearch allows you to perform aggregations to analyze and summarize the data in your search results. You can calculate statistics, create histograms, and generate various insights.

i. Filtering and Scoring:

- You can apply filters to your queries to narrow down results. Scoring mechanisms allow you to control the relevance of documents based on specific criteria.

k. Highlighting:

- Elasticsearch supports highlighting to display matching text fragments within documents, making it easier for users to identify why a document matched a query.

</div><span class="text_page_counter">Trang 22</span><div class="page_container" data-page="22">

l. Real-time Updates:

- Elasticsearch provides near real-time search, which means that new or updated documentsare searchable shortly after being indexed. The system continually refreshes indices to includenewly indexed data.

m. Caching and Query Optimization:

- Elasticsearch uses caching and query optimization techniques to improve search performance and reduce the load on the cluster.

Querying is a fundamental feature of Elasticsearch, making it a powerful tool for searching, retrieving, and analyzing data. The flexibility and richness of Elasticsearch's query system allow you to build a wide range of search and analytics applications.

<b>Indexing and Searching in Elasticsearch</b>

<b>1. Document structure and mapping</b>

In Elasticsearch, a document is the basic unit of information that can be indexed and searched. It is represented as a JSON object and consists of key-value pairs. Each document isassociated with an index and a type.

The structure of a document in Elasticsearch is flexible, allowing you to define the fields and their data types dynamically. This means that you don't need to define a schema upfront. Fields can contain various types of data, such as strings, numbers, dates, arrays, or even nested objects.

When indexing a document, Elasticsearch automatically maps the fields based on their data types. It analyzes the text fields to enable full-text search capabilities, and it indexes the numeric and date fields for efficient filtering and sorting.

You can also define explicit mappings in Elasticsearch to customize the field types, analyzers, or other settings. Mapping allows you to control how fields are indexed and queried, and it can improve search performance.

<b>2. Indexing data into Elasticsearch</b>

Indexing data into Elasticsearch is a crucial step in the process of utilizing Elasticsearch's powerful search capabilities. Elasticsearch is a distributed, open-source search and analytics engine designed to handle large volumes of data in real-time. It is commonly used for various applications such as logging, monitoring, and e-commerce search.

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

Indexing in Elasticsearch involves adding data to an index, which is a collection of documents that share similar characteristics or attributes. Before indexing data, it is important to define the structure of the documents and the fields that will be indexed. This is done through mapping, which defines the data types and properties of each field.

To index data into Elasticsearch, there are several methods available. One way is to use the Elasticsearch API, which provides a simple and flexible way to add, update, or delete documents. Another way is to use Logstash, a data processing pipeline that can ingest data from various sources and send it to Elasticsearch for indexing.

When indexing data, it is important to consider the performance implications.

Elasticsearch provides various settings and configurations to optimize indexing performance, such as bulk indexing and index sharding. Additionally, it is important to ensure data quality and consistency by validating and cleaning the data before indexing.

In conclusion, indexing data into Elasticsearch is a crucial step in utilizing its powerful search capabilities. It involves defining the structure of documents through mapping and adding data to an index using various methods. It is important to consider performance implications and ensure data quality and consistency when indexing data.

<b>3. Query DSL for searching and filtering</b>

Query DSL is a powerful tool in Elasticsearch for searching and filtering data. It enables the construction of complex queries using a JSON-like syntax. The search query, consisting ofclauses like match and bool, forms the core of Query DSL. It allows for combining multiple queries using boolean operators. Query DSL also offers filtering capabilities, such as term, range, and geo filters, to refine search results based on specific criteria. Additionally, Query DSL supports aggregations for grouping and analyzing search results, providing statistical information and data grouping. Overall, Query DSL is a flexible and essential tool for real-time data analysis and search applications in Elasticsearch.

<b>4. Highlighting and sorting results</b>

In Elasticsearch, highlighting and sorting features are crucial for improving the search experience and delivering relevant search results.

Highlighting enables users to quickly identify relevant information within a document by emphasizing matching terms in the search results. Elasticsearch provides highlighting capabilities through the highlight query or highlight field options. By specifying the fields to highlight and the highlighting settings, you can retrieve search results with highlighted snippets, making it easier for users to find the relevant content they are looking for.

Sorting, on the other hand, enables you to order the search results based on specific criteria. Elasticsearch offers various sorting options, including sorting by field values, relevance score, or custom scripts. You can specify the sorting criteria and parameters in the search request to control the order in which the search results are returned. Sorting can be performed in ascending or descending order, giving you flexibility in how you present the results.

Combining highlighting and sorting with other query clauses and filters can create more refined and informative search experiences. These features are particularly useful in

</div>