higher nationals in computing unit 14 business intelligence assignment 1

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.01 MB, 52 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

Higher Nationals in Computing Unit 14: Business Intelligence

Assignment 1

Learner’s name: Nguyễn Lê Quang Tuấn Anh ID: GCS200729

Class: GCS0905A Subject code: 1641

Assessor name: Nguyen Xuan Sam

</div><span class="text_page_counter">Trang 3</span><div class="page_container" data-page="3">

ASSIGNMENT 1 FRONT SHEET

Qualification BTEC Level 5 HND Diploma in Computing Unit number and title Unit 14: Business Intelligence

Submission date March 15, 2023 Date Received 1st submissionRe-submission Date Date Received 2nd submission

Student declaration

I certify that the assignment submission is entirely my own work and I fully understand the consequences of plagiarism. I understand that making a false declaration is a form of malpractice.

Student’s signature Grading grid

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

Summative Feedback: Resubmission Feedback:

Grade: Assessor Signature: Date:IV Signature:

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

Assessment Brief

Student Name/ID

Unit Number and Title 14: Business Intelligence Academic Year 2019-2020

Unit Tutor Unit 14: Business Intelligence Assignment Number &

Title <sup>Assignment 1: Discover business process and BI technologies </sup>Issue Date

Submission Date March 4, 2023 IV Name & Date

Submission Format

The submission is in the form of a Microsoft® PowerPoint® style presentation to be presented to your colleagues. The presentation can include links to performance data with additional speaker notes and a bibliography using the Harvard referencing system. The presentation slides for the findings should be submitted with speaker notes as one copy. You are required to make effective use of headings, bullet points and subsections, as appropriate. Your research should be referenced using the Harvard referencing system. The recommended word limit is 500 words, including speaker notes, although you will not be penalised for exceeding the total word limit.

Unit Learning Outcomes

LO1 Discuss business processes and the mechanisms used to support business making.

decision-LO2 Compare the tools and technologies associated with business intelligence functionality Assignment Brief

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

Your company is currently working in [Assumed Domain] for 2 years. For a new, young company, the competition in the market is very high. Therefore, the Board of Director has decided to apply Business Intelligence to improve the company business process by making better decisions.

The Board of Directors assigns a small group including you in Research & Development Department to study business intelligence to apply for the company in the coming years. You need to research about business processes and decision support processes in the company and identify the types of data (unstructured, semi-structured or structured) generated by these processes with examples. You also need to research about current software used in the business process or decision support process and evaluate these usages (benefits and drawbacks).

Next you need to understand the types of support for decision-making at different levels (operational, tactical and strategic) within the company and study which business intelligence features can help on that types of support. Study the information systems or technologies (of BI) can be used in this case, compare and contrast them to conclude which should be used. Your group needs to present the research results to the board in a presentation of 30 minutes.

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

<small>Learning Outcomes and Assessment Criteria </small>

<small>LO1 Discuss business processes and the mechanisms used to support business decision-making </small>

<small>D1 Evaluate the benefits and drawbacks of using application software as a mechanism for business processing. </small>

<small>P1 Examine, using examples, the terms ‘Business Process’ and ‘Supporting Processes’. </small>

<small>M1 Differentiate between unstructured and semi-structured data within an organisation. </small>

<small>LO2 Compare the tools and technologies associated with business intelligence functionality</small>

<small>D2 Compare and contrast a range of information systems and technologies that can be used to support organisations at operational, tactical and strategic levels.P2 Compare the types of </small>

<small>support available for business decision-making at varying levels within an organisation.</small>

<small>M2 Justify, with specific examples, the key features of business intelligence functionality.</small>

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

5.1 Conclusions... 23

5.2 Future works ... 23

References ... 24

Appendix ... 25

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

Table of Fingures

Figure 1: The factors impact on house price. ... 8

Figure 2 The summary of methodology ... 10

Figure 3 The raw dataset ... 11

Figure 4 Correlation ... 13

Figure 5 Two-variables Relationships ... 13

Figure 6 Correlation between Two Variables ... 14

Figure 7 Linear regression ... 14

Figure 8 Multiple regression ... 14

Figure 9 R-squares and Adjusted R-squares 1 ... 15

Figure 10 R-squares and Adjusted R-squares 2 ... 15

Figure 11 Model accuracy 1 ... 16

Figure 12 Model accuracy 2 ... 16

Figure 13 Model accuracy 3 ... 16

Figure 14 Step 1: Install basic packages for this work ... 17

Figure 15 Step 2: Install packages for data visualization ... 17

Figure 16 Step 3: Install packages for modeling 1 ... 17

Figure 17 Step 3: Install packages for modeling 2 ... 17

Figure 18 Import in Jupyter ... 18

Figure 19 Input Data ... 18

Figure 20 Statements to describe data information ... 19

Figure 21 Heatmap ... 19

Figure 22 ... 20

Figure 23 ... 20

Figure 24 explore data ... 20

Figure 25 Price versus Number of bathrooms ... 21

Figure 26 Price versus Grade ... 21

Figure 27 Price versus Square Feet of the houses exicuding basement ... 21

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

g q g

Figure 28 Price versus Square Feet of 15 closest neighbors’ houses ... 22

Figure 29 OLS Regression Result between grade and price ... 22

Figure 30 Model visualization of grade and price ... 22

Figure 31 Appendix 1... 25

Figure 32 Appendix 2... 26

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

Scientists have already incorporated a large number of data projects into machine learning, and the most often used method is Random Forest. A common supervised machine learning approach for Classification and Regression issues is random forest (Sruthi ER, ). And as we are aware, the goal of the model is to forecast future results in a variety of areas, including economics, business, sport, etc (Rachel, 2021). As a result, this approach is often used to develop models that use certain features to predict

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

1.2 Motivations

The foundations of high levels of transparency in the real estate sector include strictly enforced laws and regulations, high-quality, easily accessible market information and performance benchmarks, clear and fair practices, and high

professional standards. To fulfill this role and operate efficiently, the real estate sector needs to be highly transparent. These foundations enable governments to operate efficiently, bringing long-term benefits to local communities and the environment, while helping businesses and investors to make decisions with confidence (Jeremy, 2018).

People will search for a home that fits all of their specifications and is

affordable when they decide to purchase a home. With the aid of machine learning, we can estimate home prices with ease and determine whether a particular home is better suited for purchase or higher-priced sale. In this article, we'll make housing price predictions for King County, Washington. When calculating the price of homes in regions like King County, Washington, predictive algorithms are complicated and tough to utilize (WA). Real estate sales prices in King County may be impacted by a number of independent factors. The pricing can be significantly influenced by some characteristics, such as size, location, housing area, and so forth.

1.3 Objectives

There are a few key goals in this work that I am concentrating on:

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

▪ What impact does the size of the bathroom (bathrooms) have on the price of a home?

▪ What effect does the grade (grade) around the house have on the price? ▪ How does the price of a house change depending on the square footage

of the home minus the basement (sqft_above)?

▪ How does the average size of indoor living space for the last 15 homes (sqft_living15) affect home prices?

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

I'll present the dataset in order to address the issues raised in the first chapter. In order to extract information from raw data, there are various procedures. Figure 2 below illustrates these stages, specifically data collecting.

<small>Figure 2 The summary of methodology </small>

1.4 Summary

I described my work and laid out the project's goals in the first chapter. The remaining components of this work are a dataset introduction, my approach and findings, and an application demo.

2 Related works and dataset 2.1 Related works

The researchers (Madhuri et al., 2019) used a variety of techniques, including gradient boosting, multiple linear regression, ridge regression, LASSO regression, elastic net regression, and multiple linear regression. The authors of that study wish to examine several methodologies and gauge how much model error is introduced by

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

each. The findings demonstrate that multiple regression is one of the most effective models for forecasting home prices since it has a relatively low error statistic.

The author of another study (Rahadi et al., 2015) categorizes the elements that influence home pricing into three categories: physical state, concept, and location. A home's physical qualities include those that are visible to the naked eye, such as its size, number of bedrooms, the presence of a kitchen and garage, the presence of a garden, the size of the lot and adjacent structures, and the age of the house. On the other side, conceptual characteristics are ideas that developers use to lure purchasers, such as the idea of a minimalist home, a healthy and eco-friendly atmosphere, or an upscale location. A house's price is greatly influenced by its location. This is because

</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">

the location affects the current land price (Xiao-zhu and Ling-wei, 2013). Furthermore, the location influences how convenient it is to get to family-friendly entertainment alternatives like malls, gourmet tours, or even locations with breathtaking scenery. Public amenities like schools, campuses, hospitals, and health centers are also impacted by the location (Kisilevich et al., 2013). Research has shown that these characteristics have a significant impact on home prices.

In conclusion, a lot of research has been done on how to anticipate home values using various machine learning techniques or models. I'll be developing models and making predictions for my project using both linear regression and multiple regression. The location in King County, Washington, United States, is where I will be working on my project. I'll make use of every feature in this dataset and decide whether to create a strong model.

2.2 Dataset

2.2.1 Data collection

The information I got from Kaggle (Lemsalu, 2017). The data set includes King County, Washington, home values from May 2014 to May 2015. There are 21 columns and more than 21000 entries in the raw dataset. The price column in this dataset is the dependent variable, and all other columns aside from id and date— —are independent features. The draw dataset's head is shown here.

</div><span class="text_page_counter">Trang 22</span><div class="page_container" data-page="22">

The price and the other factors are independent variables in Figure 3, which shows the dependent continuous value of this study.

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

2.2.2 Description dataset

• Id: the house's individual identification number • Date: the date when the house was sold • Price: The home's price .

• Bedrooms: number of bedrooms • Bathrooms: the number of bathrooms • Sqft_living: The home's square footage • Sqft_lot: The lot's square footage • Floors: number of floors

• Waterfront: house that has waterfront view • View: the house has view

• Condition: Rate the home's condition on a scale of 1 to 5. (overall) • Grade: The dwelling unit's grade on a scale of 0 to 10. (overall) • Sqft_above: living area of the home, excluding the basement • Sqft_basement: the basement's dwelling area in square feet

• Yr_built: year that the house built\sYr renovated: year that the house renovated • Zipcode: the home's zip code

• Latitude: a coordinate system • Longitude: a geographic location

• Sqft_living15: The interior space where the homes of the 15 closest neighbors are located.

• Sqft_lot15: the sum of the 15 nearest neighbors' land lots in square feet. 2.3 Summary

In this section, I go over the effort involved and mention a few more studies that make use of the same data but employ various approaches, allowing you to pick and choose what works best for you. Indicate the number of dependent and

independent values in the data and how many columns and columns there are in total.

</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24">

Further to defining the raw data set's component names. 3 Proposed model

3.1 Correlation

In essence, the correlation evaluates the difference between two variables (Hauke and Kossowski, 2011). According to the correlation coefficient formula (David Groebner, 2017).

</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">

<small>Figure 4 Correlation </small>

The Pearson product moment correlation is the name of the function described above. The scatter plot's pattern can be seen like the illustration in Figure 5 below to determine whether the two variables are correlated:

<small>Figure 5 Two-variables Relationships</small>

The correlation coefficient, or r, can be positive or negative, with a perfect

</div><span class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">

correlation being +1.0. (the perfect negative correlation). There is no correlation between the x and y variables if r = 0. This is the ideal connection if the scatter plot's data points all fall along a straight line. As a result, the correlation deviates from 0.0 to a greater extent the stronger the linear connection between the two variables. The direction of the link is shown by the correlation coefficient's sign (David Groebner, 2017).

</div><span class="text_page_counter">Trang 27</span><div class="page_container" data-page="27">

<small>Figure 6 Correlation between Two Variables</small>

3.2 Linear regression

Study of the fundamental equation for a single linear regression (David Groebner, 2017). The relationship is depicted as follows in the equation where x is the dependent variable and 1 is the dependent variable as the outcome:

<small>Figure 7 Linear regression </small>

3.3 Multiple regression

In this project, I utilize multiple regression to forecast the average book rating

</div><span class="text_page_counter">Trang 28</span><div class="page_container" data-page="28">

based on three features: the volume of the book, the number of text reviews, and the number of ratings. Here is the equation for multiple regression (David Groebner, 2017):

<small>Figure 8 Multiple regression </small>

3.4 R-squares and Adjusted R-squares

The coefficients of determination R^2 or modified R^2 are probably the most frequently used statistics in regression to assess how well a model fits the data. These

</div><span class="text_page_counter">Trang 29</span><div class="page_container" data-page="29">

statistics indicate how much variation in the response is explained by the model (Akossou and Palm, 2013).

<small>Figure 9 R-squares and Adjusted R-squares 1 </small>

The likelihood that the regression line will accurately represent the actual data points is statistically assessed using the multiple coefficient of determination. R-squared, in other words, shows how closely the data match the regression model. R-squared values normally range from 0 to 1, from 0% to 100%. If the R-squares value is negative, this indicates that the model's performance is subpar (Chicco et al., 2021). As an illustration, if the R-squares value is equal to 0.8, then the independent variables are responsible for 80% of the variation in the target variable. The better the model fits, the greater the R-squared score.

The percentage of variance that can be accounted for by simply the independent variables with a substantial influence on the explanation of the dependent variable is determined by the adjusted R-squares method. Only when the independent variable has an impact on the dependent variable do the R-squares rise.

</div>