A big data predictions system using similarity method
The weather forecast from a scientific point of view as one of the most complex problems of atmospheric physics. General characteristics of the big data forecasting system using the similarity method. Features of using big data for digitization.
Рубрика | Программирование, компьютеры и кибернетика |
Вид | дипломная работа |
Язык | английский |
Дата добавления | 13.07.2020 |
Размер файла | 6,7 M |
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
Размещено на http://www.allbest.ru/
A big data predictions system using similarity method
The world is on the threshold of the era of big data, and people are facing it daily. The relevance of this work is due to the existence of all current realities in the context of Big Data, which open up new opportunities for almost every sphere of public life. The problem is the complexity of processing such volumes of data, for which a large number of devices and programs were created, but which in turn gave rise to many industries.
The prediction of weather from a scientific point of view is one of the most difficult tasks of atmospheric physics. There are various methods for forecasting, but in full no method provides an accurate forecast so far. Also there are no effective officially adopted methods of forecasting for a long time (for a year for example). Labor-intensive research is required in this area.
This paper explores main features of Big Data technology area, weather forecasting modern cases. As a result - creation of MVP of a algorithm that on a first step tries to predict weather for the next seven days based on one feature (daily temperature) of difficult structured dataset.
ACKNOWLEDGEMENT
I would like to particularly extend many thanks to Alexandr Gorbunov, my supervisor at National Research University Higher School of Economics, for his continuous provision of direction, many useful suggestions and constructive feedback which has enabled me to complete this paper.
I would like to express my sincere thanks to the whole faculty for all the knowledge and opportunities provided to understand how to cope with complex tasks in a new direction.
1. Introduction
The historical economic interest of society in the weather, especially in its subsequent state , is a necessity caused by the dependence of human activity on the external environment. The dependence on weather increased as the world community developed, the population grew, and the territory used increased. As time went on, the weather became an increasingly "fierce" opponent of people's creative activity. Therefore, the development of this industry is in constant motion.
Big Data is one of the key tools for digitalization. Their use in public administration and business was started at the turn of 2010. But the relevance and possibilities of using this technology only increase over time. The reason is that the amount of information generated by humanity is growing rapidly. And to use it effectively, an increasing number of users have to connect to the analysis and processing of big data. This topic also helps to develop various spheres of public life and affects the changing structures of almost all institutions of society. Thus, an increase in the number of data sources and the total amount of information received allows us to understand the real picture of the world.
The need for big data in forecasting was obvious with the history of development of both areas, thus they developed not only themselves, but also everything around them. Over time, the methods and methods of work have constantly changed and opened up new opportunities for using and processing the data obtained. In order to contribute to the development of both super-important areas of life, it is proposed to consider one of the developed methods of weather forecasting using the similarity method, which, when further developed and implemented in a more extensive and complex system, will increase the percentage of reliability of the forecast, as well as help develop smaller projects, as one of the factors that can be used for additional analysis of the current situation in the meteorological space.
1.1 Problem definition
Over the past decades, the world meteorological community has made significant advances in the development of numerical weather forecasting technologies. However, it is not possible to completely eliminate errors in weather forecasts. Today, automated predictive technologies are not able to predict certain weather events. This is due to the fact that many weather phenomena, including dangerous phenomena, have a local character and a complex nature of formation, which is currently difficult to describe formally in order to fully automate the forecast with an acceptable level of success.
However, increasing the number of stored weather data and improving their quality allows us to analyze the causal relationships of weather events and conditions in a more distant history, but the quality of the forecast is affected not only by the amount of data, but also by the complexity and breadth of the systems that make the analysis and subsequent forecast.
The variety of methods considered for the current day together makes up complex forecast systems. Some researchers have identified a link between weather conditions, in particular temperature, and historical data. This led to the formation of the similarity method, which states that the similarity of the pre-forecast period and the same analog in existing data, declares the subsequent similarity of the forecast period and the subsequent week after the historical analog. The main problem of this work is the implementation of the similarity method and the search for the necessary database for the following.
1.2 Research objective
Currently, there is a method similar to the method for calculating time series, but in this interpretation, when the basis for a similar coefficient to the required temperature period does not exist. This method consists of a fundamentally different logic and research in previous years has shown its effectiveness. Similarity in time series forecasting is often based on a long-term relationship that eventually leads to the predicted period, but in this iteration, historical data is used as a parallel metric, and the similarity to the hour of which must be found. The main goal is to offer a working algorithm based on the current data sets and methods of their processing, which is a minimally working product based on mathematical justifications. As well as at the end of the work, suggest further steps to improve this project.
1.4 Research significance
Theoretical value
1. In this study, we study the applicability of various software for the collection and processing of big data. Two cases implemented in business and in the field of weather forecasting were investigated and analyzed.
2. The proposed method of weather forecasting by similarity method can help to improve the General time series forecasting systems with a further increase in the number of characteristics.
Practical significance
1. Based on the results of this work, you can build a weather forecast for the future short-term period of time or improve the current forecast algorithms by similarity using the proposed methodology.
2. The method under Study can potentially be useful for other areas of time series forecasting if there is the same data set in terms of properties and characteristics as for meteorological problems.
1.5 Thesis structure
The first Chapter of the dissertation is an introduction to the problem. It contains the problem statement, research goals, and scientific significance for both theoretical and practical application.
The second Chapter is an overview of cases that currently use big data and a review of modern software for solving data collection and analysis tasks.
Forecasting methods are discussed in General in the third Chapter, which provides a comprehensive description of existing ways to collect meteorological data and provides an example of forecasting using SARIMA.
The fourth Chapter explains the concept of the similarity method and how it works.
The method implementation is described in the fifth Chapter.
The conclusion of the dissertation is expressed in the sixth Chapter.
2.Big data implementations in cases of modern reality
2.1 Variations of definition “big data”
The problem of definition, understanding and history of the term "big data" directly affects the possibility of using the methods and tools offered by this broad field in humanitarian research. Paradox: despite belonging to the exact Sciences, the digital environment and the IT sphere, the concept of "big data" ("big data") does not have a clear definition. Many authors, organizations, and communities try to interpret the concept in different ways. There are some variations of this term.
In June 2013 Oxford English dictionary (The Oxford English dictionary, OED) added a definition of the term "big data". “Computing (also with capital initials) data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data.”
There is one more interesting, the site www.lexico.com which focuses on current, relevant word meanings and practical usage, gives the following definition of "big data»: “Extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.” The technological term in this interpretation is closely related to the socio-humanitarian component, i.e., in relation to the original definition, there is an evolution of the concept towards its humanization, and the content of the concept changes from "serious difficulties" to practical = opportunities for analysing human activity.
Researcher, entrepreneur and author of the blog whatsthebigdata.com Gil Press in the publication "a Very short history of big data" on the website Forbes.com the big data problem dates back to the appearance of the term "information explosion" in the Oxford English dictionary in 1941. Then there were the first attempts to quantify the growth rate of data volumes. Gil Press also records the first meaningful use of the term "big data" in October 1999 in the digital library of the Association for computer computing (ACM Digital Library) - in an article by NASA researchers on the problems of information visualization.
In addition, in 2001, Doug Laney published a seminal study for Meta Group that identified three key parameters of big data: volume, speed, and variety (the so-called three "V's" : Volume, Velocity, and Variety). Big data is characterized by its gigantic size (there are no exact definitions), high rates of new generation and influx, and heterogeneity and disorderliness. And these three "vvvs" are recognized by all experts in one form or another, which indicates an earlier appearance of the concept and its evolution.
Everyone knows that the essence of Big Data technologies is working with huge amounts of data (which, however, follows from the term itself). But the volume of data doesn't make the weather yet. Analysts have come up with a succinct formula for Big Data - they believe that the definition of a Big Data project should include seven important characteristics, "7 V": Volume, Velocity, Variety, Veracity, Variability, Visualization, Value. That is, the volume, velocity, variety, veracity, variability, visualization, value. At the same time, each "V" is important for understanding the overall picture.
The first three V's are the least problematic: Volume, Velocity, and Variety. Indeed, who would argue that Big Data is primarily volume? The volume of data is growing exponentially: for example, aircraft annually generate 2.5 billion TB of data from sensors installed in engines. At the same time, data is constantly updated, new ones are generated, and the speed of the update (Velocity - the second "V") is also important in order to consider it "large". For example, every minute in the world, almost 2.5 million requests are made to the Google search engine. The challenge of Big Data projects is to cope with the tremendous speed at which data is created and analyze it in real time.
The third " V " is Variety, variety. This means that Big Data projects must include data in a variety of formats: structured and unstructured data, text, graphics, corporate email or social media data, up to video. Each of these data types requires different types of analysis and appropriate tools. Social media can help brand owners analyze customer moods, and sensory data can provide information about how the product is most often used to apply this knowledge to improve it.
Until recently, three V's were enough. But everything changes, including approaches to definition. So analysts added four more "V's" to avoid misunderstandings. So, Veracity, Variability, Visualization, and Value were added to the definition. Let's look at each of these points.
Veracity-Confidence: of course, this characteristic is extremely important, because any analysis will be completely useless if the data turns out to be unreliable. Moreover, it is extremely important for us to make sure that everything is OK with the data, because their inaccuracy can lead to incorrect decisions. The simplest example is contacts with false names and inaccurate contact information.
Variability-Variability: a new trend in the field of Big Data. Here we are talking about the fact that the meaning of the same data may differ depending on the context, for example, the same words on Twitter may have different meanings and reflect different moods. We must consider all the nuances! In order to perform proper mood analysis, algorithms must be able to understand the context and be able to decipher the exact meaning of a word in that context.
Visualization-Visualization: this is a necessary part of analysis, because it is visualization that makes big data accessible to human perception. Visualization of large volumes of complex data is much more efficient and understandable for a person than spreadsheets and reports full of numbers and formulas. Of course, visualization within Big Data does not mean building regular graphs or pie charts: complex graphs may be built that include a lot of variable data, but they will still be clear and readable.
Value-Value: here we are talking about getting the most out of the results of big data analysis. What matters is how you use this data and whether you can turn your organization into an advanced company that relies on ideas derived from data analysis to make decisions.
However, even these seven "V's" are not enough to understand the essence of Big Data: we are talking about the fact that all these seven characteristics must be applied to a complex task, usually with several variables and a non-trivial condition. And a small conclusion in the end: of course, we could not pass Big Data with a clear conscience, so now Prognoz Platform is developing all the technologies necessary for working with big data: support for Hadoop, integration with software and hardware complexes, integration with SAP solutions[].
2.2 Analysis of modern methods of dataprocessing
2.2.1 Implementation in philip morris case
Phillip Morris International (PMI) company works in two directions: sale of cigarettes and sale of RRP products. Each of the areas within the company is still quite strongly divided by personnel and organization of data storage. The company has various channels for creating its products, a huge amount of employees, a variety of approaches to data processing and analysis, and storage.
Figure 1. Data processing in PMI
The primary data source in PMI company is Salesforce, through which all transactions are made with the customer. In this system, data is entered for all points of sale, customers, employees, SMS, transactions and other objects. Data from Salesforce flows into the data lake on Amazon S3, from where the requested information moves to the clusters on Redshift.
First product of AWS they are use is Amazon S3. This is an object storage service that offers industry-leading scalability, data availability, security, and performance. S3 organize data and configure finely-tuned access controls to meet your specific business, organizational, and compliance requirements. In PMI use ETL processes to S3 data with standard SQL expressions. In summary, our business- process we put our data from Salesforce, than by ETL or query we get needed information to Redshift, where later we by query SQL put it to our analyze by BI instruments or another products, that were created.
Here is a schema of our clusters in Redshift.
Figure 2. Number of Clusters.
big data similarity digitization
In this picture (Figure 2) we can see that in company we have devision to Production (PROD) and Development (Dev) parts of database. Also, there are some difference between this clusters (Figure 3).
Figure 3. Technical features of having clusters.
So main difference in the number of nodes and their types. That's a good point that PMI uses new type of nodes (ds2, dc2) For example, PROD cluster 1 has a ds2.xlarge nodes that means, that they use higher performance than their previous type (ds1). Second Cluster has another type (dc2), which is different by using place. That means that DS2 node types are optimized for large data workloads and use hard disk drive (HDD) storage. And DC2 node are optimized for performance-intensive workload. In a case they use solid state drive (SSD) storage, the DC delivers much more faster input/output compared to DS node types, but provide less storage space. Number of nodes depends on productivity of that or those cluster. A lot of nodes on a second PROD cluster gives us a really good experience of work. A lot of reports are using data on the second cluster, and it is really heavily loaded.
Third column represent number of virtual CPUs for each node. Here I can say that our second production cluster is really important, and PMI is ready to pay for high performance. Our first PROD cluster usually uses by different parts of a company to take primary and dirty data. On the second Cluster we put correctly and logically collected tables. DEV Cluster are used by for different tasks, and I haven't used it at all.
2.2.2 Implementation in the weather channel case
Weather sets the mood, which is a very important aspect of consumer behavior - 84% of customers make impulsive (spontaneous) purchases in online stores (40%) and offline (60-80%). Therefore, marketers around the world are trying to identify patterns of changes in the level of sales depending, including on external factors such as weather.
The Weather Channel, an American pay-TV channel, tracks the impact of weather on the emotional state of its viewers in order to suggest the most effective ways and moments of sending messages to advertisers, taking into account the trends and geolocation of customers. As a result, more than 100 TB of data is obtained every day, which allows you to update the weather forecast every 15 minutes with an accuracy of 500 square meters in some regions. The effectiveness of this approach was confirmed by a joint marketing campaign of the brands Pantene, Walgreens and the Weather Channel. Based on Big Data from the Weather Channel and Machine Learning's own predictive models, Pantene and Walgreens advertised a product for curly hair during the period when humidity reaches the limit. Thanks to this strategy, the walgreens pharmacy chain saw sales of Pantene products increase by 10% in July and August, while other hair care products increased by 4%. Here is the way of Data processing using Big Data for weather forecasting.
Figure 4. Data processing of meteomarketing (Pantene, Walgreens и The Weather Channel).
Technically, the architecture of a Big Data system based on collecting weather data from various sources (including mobile devices) and analytical processing of the received information using predictive Machine Learning models can be implemented as follows:
· Apache Kafka provides continuous data collection and aggregation from weather stations, mobile devices, and aircraft;
· Spark Streaming gets information from Kafka topics and builds predictive Machine Learning models based on this data using the Spark MLlib component;
The results of Analytics are transmitted to BI-systems and data storefronts (dashboards) for making management decisions, and are also sent to end users (passengers) in the form of marketing offers and recommendations for profitable hotel reservations in the event of flight cancellation.
2.2.3 Cases software comparison
In order to understand this difference in the use of current software, it is necessary to take into account the type of activity in each of the cases and what is the purpose to use this programs in summary.
Apache Spark is a data processing engine. It can:
· process batch and streaming workloads in real-time
· write applications in Java, Scala, Python and R
· use pre-built libraries for building those apps
Figure 5. Spark Architecture.
Amazon Redshift is an analytical database. With Redshift you can:
· build a central data warehouse unifying data from many sources
· run big, complex analytic queries against that data with SQL
· report and pass on the results to dashboards or other apps
Redshift is a managed service provided by Amazon. Raw data flows into Redshift (called “ETL”), where it's processed and transformed at a regular cadence (“transformation” or “aggregations”), or on an ad-hoc basis (“ad-hoc queries”). Another term for loading and transforming data is also “data pipelines”.
Figure 6. Amazon Redshift Architecture.
Figure 7. Comparison of Spark and AWS Redshift Software.
There is the difference between Spark and Redshift in the way they process data, and how much time it takes to product a result.
· Spark: you can do real-time stream processing, i.e. you get a real-time response to events in your data streams.
· Redshift: you can do near-real time batch operations, i.e. you ingest small batches of events from data streams, to then run your analysis to get a response to events.
To sum up, Spark helps to see situation in a moment and with ML models can show explored case near-real time character. But if you want to have more signals for your decision, for better predictability, it's better to use Redshift.
Main difference between chosen cases - functions that are needed. As it can be understood in weather case - firstly, it needs to be gone through ML processes and then collected and analyzed, but in PMI - firstly, collect data, then sort it and after this step, structured data can be used for ML processes in company, or for BI analyze. Therefore, each system was selected specifically for its task. If you need to process data quickly and get a direct response to the current situation, it is better to choose Spark. In a situation where the volume of data includes a larger range of characteristics and data to include - initially, you need to use methods for long-term and uninterrupted data storage, then use ETL processes to sort and upload key information to the cluster, and only at the last step, the data can be processed by the rest of the company's processes that have access to the datamart data.
3.Chapter three: weather forecasting
3.1 Introduction
The atmospheric air of the Earth is in constant motion. We feel its movement in the wind, which carries heat from the equator to the poles and moisture from the sea to the land, where it falls in the form of life-giving rain.
The only source of energy that causes the movement of the atmosphere is the sun.
Uneven heating of the earth's surface, which in turn heats the air, creates a difference in atmospheric pressure. The cold air is denser, so it goes down and creates a high pressure area. Wind is the movement of air from high-pressure areas to low-pressure areas.
Thus, constantly formed a certain state of the atmosphere, called the weather. In turn, people are trying to monitor weather processes using meteorology and observe global climate change affecting the local weather in the regions.
Weather forecast -- a scientifically based assumption about the future state of weather at a particular point or region for a certain period. Compiled (developed) by public or commercial meteorological services on the basis of meteorological methods. Here are the various methods used in weather forecasting:
* extra-short-term -- up to 12 hours;
* short-term-- from 12 to 36 hours;
* medium-term-- from 36 hours to 10 days;
* long-term-- from 10 days to the season (3 months);
* extra-long-term -- more than 3 months (a year, several years).
At the end, having been acquainted with the basic knowledge on this topic, its needed to look at the methods on the basis of which the weather forecast is now carried out, as well as to find attempts by some enthusiasts to solve this problem.
3.2 Analysis of existing methods and their probability
In modern meteorology, it's possible to determine three large groups of weather forecasting:
· Synoptic weather forecasting,
· Numerical methods, and
· Statistical methods.
Synoptic weather forecasting
“Synoptic" means that the observation of different weather elements refers to a specific time of observation. Thus, a weather map that depicts atmospheric conditions at a given time is a synoptic chart to a meteorologist. In order to have an average view of the changing pattern of weather, a modern meteorological center prepares a series of synoptic charts every day.
Numerical methods
The numerical method involves a lot of mathematics. Modern weather forecasting is now using the techniques of Numerical Weather Prediction (NWP). This method is based on the fact that gases of the atmosphere follow a number of physical principles. A series of mathematical equations is used to develop theoretical models of the general circulation of the atmosphere. These equations are also used to specify changes in the atmosphere as the time passes on. However, even the most simplified mathematical models involve incredibly large number of calculations to be made. The use of such models could only be possible after the advent of high-speed electronic computers.
Statistical methods:
Statistical methods are used along with the numerical weather prediction. This method often supplements the numerical method. Statistical methods use the past records of weather data on the assumption that future will be a repetition of the past weather.
The main purpose of studying the past weather data is to find out those aspects of the weather that are good indicators of the future events. After establishing these relationships, correct data can be safely used to predict the future conditions.
3.3 Analysis current weather forecasting data sources and ways for getting data
In order to implement a short-term weather forecast project, relevant data is required. There are many different sources that I can provide the dataset in a convenient way for each project. Thus, there are two main ways to implement a path for data collection and processing: data upload via the API, or a one-time upload to the server followed by a point request to upload data in a specific format.
A more expensive way is to upload a full-fledged dataset and then load it. One of the resources is cite of “National centers for environmental information”[18].
Figure 8. Data Search Panel of NOAA
Figure 9. Example of data inferiority
There are several problems:
· Uploading data via a csv file, which is not a convenient format for transmitting values;
· Also, if you select a long period, uploading does not guarantee full filling. Thus was revealed the gaps in the values;
· It is not possible to check the file's fullness in advance, only when it is finally received;
· When unloading it is also impossible to select the measures dimension. By default, it selects Kelvin, which is not convenient for working with Celsius initially.
Another data source is DataClim [25].
Figure 10. Historical Climat Data Panel.
Figure 11. Example of one zip file.
Main problem of this dataset is a large amount of data for the day from various geopositions and the use of multiple time marks. The current task requires data on daily temperatures, either measured once a day or calculated as an average value. The provided data set does not allow you to accurately build a chain of consecutive single-day temperatures in a convenient format for processing.
Taking into consideration that the basis of this thesis is the development of a basic working algorithm, it is more convenient to use the site Application programming interface (API), which will provide data in the desired format.
The API is a contract that the program provides [26].
API a program interface for interaction between systems that allows you to:
· Get access to the company's business services
· Exchange information between systems and applications
· Simplify interaction between companies, partners, developers, and customers
The most common in the world wide web are the so-called Web APIs, which are used as a platform for creating HTTP services. Among them are:
· RPC (Remote Procedure Call) - remote procedure call,
· SOAP (Simple Object Access Protocol) - simple object access Protocol,
· REST (Representational State Transfer) - transmitting the state of the view.
The API can be divided by the type of service that has them: applications, websites, and operating systems. For example, most operating systems (Unix, Windows, MacOS, etc.) have an API that allows you to program services for these systems.
The API can also be divided by access type:
· Internal APIs (available to the company's internal developers and employees, used to optimize workflows and reduce costs),
· Partner APIs (available to business partners and consumers of the product or service, used to optimize processes and development),
· Public APIs (available to everyone, used to create new services and promote the existing direction).
There are many advantages to using the API:
· The main advantage of working with the API is saving time when developing your own services. The programmer gets ready-made solutions and does not need to spend time writing code for functionality that has been implemented for a long time.
· The API may take into account nuances that a third-party developer may not take into account or simply not know,
· The API gives applications a certain consistency and predictability - the same function can be implemented in different applications using the API in a way that is clear and familiar to all users.
· The API gives third-party developers access to closed services.
But there are also disadvantages:
· If changes and improvements are made to the main service, they may not be immediately available in the API,
· Ready-made solutions are available to the developer, they don't know how they are implemented or what the source code looks like,
· The API is primarily intended for General use. it may not be suitable for creating any special functionality.
The most popular weather data source is “Openweather” [23] .
Figure 12. OpenWeather Selection type of dataset Panel.
The site, which is quite popular among developers, provides data for a long period of time for various cities and locations.
OpenWeatherMap is an online service that provides a paid (there is a functionally limited free version) API for accessing data about the current weather, forecasts, and various applications. Archived data is only available on a commercial basis. Official weather services, data from airport weather stations, and data from private weather stations are used as data sources.
The information is processed by OpenWeatherMap, and then the weather forecast and weather maps are built based on the data. The main idea of the OWM service is to use private weather stations that help improve the accuracy of the original weather information and, as a result, the accuracy of weather forecasts [24].
Figure 13. OpenWeather Data Selection Panel.
Figure 14. Example of JSON file from OpenWeather.
big data similarity digitization
There is a problem that this service does not have enough extensive data for Russia. In addition, further development of the project will require increasing the number of locations, which leads to an increase in costs: 10 dollars per location. Therefore, a search was conducted for a resource that provides data on daytime temperatures for various time periods and locations without hindrance. “Meteostat” was chosen this way.
Figure 15. Main page of Meteostat.
This is a free online service that provides weather and climate statisticsх[22]. The database combines current weather observations, historical statistics, and long-term climate data. All data available to the weather station relates to the exact location of the corresponding weather station and was measured on the spot [21].
Figure 16. List of Meteostat sources.
Figure 17. Russian institution of weather dataset.
Unlike many other weather apps, Meteostat aims to provide detailed statistics on historical weather data. Meteostat tries to answer questions about the occurrence of certain meteorological phenomena with a General trend towards higher temperatures and more extreme weather events using a large database. However, the Meteostat platform tries to be a source for inexperienced users, so the interface is maximally adapted for beginners.
Figure 18. Main page of API Meteostat.
Figure 19. Content of JSON file.
After analyzing several sources, the choice was made on Meteostat. The main advantage is the ease of use and the breadth of data provided[19]. The temperature measurement system also matches the system used in Russia.
3.4 Example of prediction system using sarima
There are many different methods for predicting different metrics. The main and General task is to forecast time series. Time series forecasting assumes that data obtained in the past helps explain values in the future. It is important to understand that in some cases we are dealing with details that are not reflected in the accumulated data. Often, different approaches to forecasting combine to provide the most accurate forecasts, as is the case when forecasting weather conditions.
A time series is a collection of values obtained over a period of time, usually at regular intervals.
If you consider how the value changes from one period to another and how to forecast values, keep in mind that time series data has some important characteristics:
· The base level is usually defined as the average value of a time series. In some forecasting models, the baseline is usually defined as the initial value of the series data;
· A trend usually shows how time series change from one period to another;
· Season fluctuations. Some values tend to increase or decrease depending on certain time periods, such as a day of the week or a month in the year;
· Some prediction models include a fourth characteristic, noise, or error, which refers to random fluctuations and uneven movements in the data. We will not consider noise here.
Autoregressive integrated moving average, or ARIMA, is one of the most widely used forecasting methods for single-factor forecasting of time series data.The problem with ARIMA is that it doesn't support seasonal data. This is a time series with a repeating cycle.
ARIMA expects data that is not seasonal or the seasonal component is deleted, for example, it is seasonally adjusted using methods such as seasonal difference.
There are three trend elements that need to be adjusted. They are the same as the ARIMA model; in particular[17]:
· p: Order of trend autoregression.
· d: the Order in which the trend changes.
· q: is the Trend moving average.
· Seasonal elements
There are four seasonal elements that are not part of ARIMA that must be configured; they are:
· P: Seasonal autoregression order.
· D: Order of seasonal differences.
· Q: Seasonal order of moving averages.
· m: the Number of time steps for a single seasonal period.
On the Figure is result of the auto_arima process.
Figure 20. Build model autoarim, which automatically selects the parameters of the model.
Figure 21. Build model with known parameters.
Figure 22. Final prediction for the next month by the model.
Thus, the forecast of a non-stationary series using the seasonal auto-rate method demonstrated a different method of using data loaded for two years.
4.Chapter four: similarity method
4.1 Definition of the required parameter from the data set
It is the kind of statistical predicting methods. The characteristics of any phenomenon in a group of similar can be obtained using similarity theory.
To facilitate the search for similarity from a huge array of weather data in the first stage, it is advisable to highlight some individual features. To study the characteristics of the weather hypothesis is accepted:
- deterministic component of weather conditions on average for the period;
- random nature of weather changes over a shorter period of time.
Its allows the processing of statistical information using the mathematical apparatus of probability theory.
Statistical characteristics of a random process are sufficiently fully reflected by the following indicators:
· mathematical expectation (average value of weather parameter Xi for n observations):
standard deviation (variance):
autocorrelation function, reflecting the rate of change in the process (in this case - weather changes):
· mutual correlation function (the covariance), showing the relationship of the phenomena (in this case, various parameters of the weather - the night and daytime temperatures, the pressure of the atmosphere, etc.):
Where УN-T - another observed through the interval T setting the weather.
From the given properties and the information available to the authors to reduce the cost of finding analogues in a large amount of weather data, it is advisable to use one of the parameters of the first day of the weekly forecast.
4.1.1 The hyposesis about the weather as random function
The main position of probability theory (Central limit theorem) - the sum of a large number of causal factors gives a normal distribution.
The properties of a random variable are described by the integral distribution function F(x), which reflects the probability that the i observation will be less than a certain value of X (figure 22a), and more clearly by the differential function P(x) - the distribution density characterizing the probability (frequency of manifestation) of Xi values (figure 22B).
In measurements, the normal distribution law most often takes place:
where is the expected value (mean):
· dispersion:
· standard deviation(SD):
Root extraction is a non-linear procedure, and to eliminate the bias of the estimate, the result is (4. 5) multiply by the correction factor depending on n (for n=3...7 > Kn=1.13...1.04).
Figure 23. Integral (a) and differential (b) functions distributions of a random variable.
The Central limit theorem - the distribution of random errors will be close to normal whenever the results of observations are formed under the influence of a large number of independent factors.
The result of the evaluation of statistical characteristics depends on the number of measurements and differs from the hypothetical value of the population. The estimation of the SSR is characterized by an error that also depends on the set of statistical data (for n=3...15 > Ду=50...15%). For the investigated parameter with unknown ME(eng)=MO(rus) and SD(eng) their assessment is made by results of the made measurements and dependence on quantity of these measurements :
In practice, even with high requirements for the accuracy of estimates make 30...50 measurements. The result of multiple measurements can be interpreted as an increase in the confidence probability of the error interval.
ME and SD fully characterize the random function with normal distribution law.
There are methods to test the hypothesis:
· on the normal law of distribution;
· uniformity of observations;
· emissions (miss) of observations.
The probability of getting the result of observation during measurements in the range of values X1-X2 is equal to the area bounded by the distribution curve and the specified interval:
4.1.2 The similarity of the processes
In assessing the similarity of technical systems, distinguish the so-called classical types of similarity:
· deterministic (physical) similarity;
· stochastic similarity.
The deterministic similarity criterion (one-to-one correspondence between two objects) is based on the doctrine of the dimension of physical quantities and reflects the physical similarity of systems in functions, design parameters, manufacturing technology, materials used, failure processes, etc. Physical phenomena, processes or systems are similar if at similar times in similar points in space the values of variables characterizing the state of one system are proportional to the corresponding values of another system .
The estimate for total similarity is any value of the constants or of the similarity of Пi (which are not violated physical processes):
Пi=Ai /ai =Пmin…Пmax =П =idem
Where A1, A2,..., An - parameters of a technical system (base);
A1, A2,..., an are the parameters of newly created technical systems.
This position is illustrated by Figure 23, where the similarity coefficients of some set оа parameters have the same values within the conservation of physical processes (Пi min... Пi max).
Figure 24. Deterministic similarity
Physical similarity is a generalization of the elementary and visual concept of geometric similarity of sizes Ai and ai, masses M1 and m2 (Figure 25).
Figure 25. An example of a geometric similarity.
The principles of similarity in the stochastic sense are based on the fact that the parameters included in the similarity criteria are random variables, and the similarity criteria themselves are functions of these random variables. The compared technical systems are similar if the products and physical processes occurring in them, as well as the parameters have identical distribution densities, and the similarity criteria, as a function of probability density distribution, are within the boundaries of the confidence interval:
Where - respectively, the upper and lower bounds for Пi defined with some confidence probability.
4.2 similarity. definition. criteria
Characteristics of any phenomenon in a group of similar can be obtained by similarity theory.
Similarity is a one-to-one correspondence between two processes, in which the transition function of the parameters characterizing one of them to other parameters is known, and the mathematical description can be transformed into the identical one. This means that the phenomenon under investigation can be obtained by similarly given by such a transformation, when the size of each of its magnitude changes a certain number of times (such a transformation of phenomena).
Deterministic criteria reflect physical similarity in functions, parameters, processes.
Stochastic similarity criteria in a generalized way reflect the presence of random factors, the variation of parameters, etc.
Fuzzy similarity - in the absence of numerical accurate estimates and the use of semantic evaluations of phenomena. The limit of similarity is the identity of phenomena, which can be estimated by simple comparison. However, to process large amounts of data in long-term forecasting, it is necessary to formalize the procedure for determining the similarity.
4.3 The use of covariance and algorythm of fuzzification
After preliminary selection of an analogue of weather Xi for a certain retrospective period, in order to estimate its proximity to the pre-forecast period Yi, we will use the normalized cross-correlation coefficient (covariance):
Figure 26. Pearson correlation coefficicent.
The values ??with CCS ? 1.0 are strongly correlated, with smaller values ??of CCS -- a weakly correlated relationship (Figure 26). Between these extreme values, an estimate can be introduced in terms of the mathematical theory of fuzzy sets (in terms of membership functions) and a linguistic estimate of fuzziness -- fuzzification procedure. For example, for five terms of linguistic evaluation (“Very Small”, “Small”, “Medium”, “Above Average”, “Strong” communication), the diagram of rating distribution is shown in Picture 27.
However, according to the CCS, only a relative relationship of phenomena can be established - the covariance values ??for dependencies 1-3 (Figure 26) are the same, and do not reflect the unambiguous relationship 1 and mutually opposite to 2 and 3. That is, to determine the degree of similarity, an absolute ratio of the compared values ??is also required , which can be judged by the angle of the direct Kxy in the sector from цХУ = 00 ... р / 2 relative to the average value р / 4 What can also be represented by the membership functions and terms of linguistic evaluation (for example, "Less", "Equal", "More").
Therefore, at the first stage, it is advisable to calculate the value of covariance, and at its acceptable level (“Above average”, “Strong”), estimate the angle of the straight line of the CCS. Next, combine these two estimates (implication) by the logical connection “AND”, which can be performed according to the values ??of the membership functions and the laws of fuzzy logic.
Figure27. Fuzzification procedure
The logical conclusion can be made on the basis of the rule base, based on the knowledge of experts, used in fuzzy logic. In the most general form, the rule base is presented in the form of a structured text :
RULES_Ri: IF (X = A) AND (Y = B) THEN (Z = C) * Fi
where (X = A), (Y = B) - input linguistic variables;
(Z = С) - output linguistic variable;
Fi = [0 ... 1] are the weighting coefficients of the corresponding rules.
For prediction, one of the rules might look like this:
RULE 1: If the Kxy is “Strong” and цХУ “Average” THEN Similarity “Full”
And we can assume that the weather for the forecast period with a high probability will be the same as it was after a similar period in the past.
Conditions (13) may have a more complex form and consist of many parts connected by AND, OR. When setting or forming a rule base, it is necessary to define a set of fuzzy product rules, a set of input linguistic variables, and a set of output linguistic variables. An input or output linguistic variable is considered to be given (defined) if the base term set for them with the corresponding membership functions of each term is defined for them. For weather forecasting there is plenty of historical data to compile a rule base of any desired composition.
There are also other methods of implication as a method for calculating the general membership function with further defuzzification of the output results (by moving from the membership functions and semantic to numerical evaluations).
5.Chapter five: realisation of the forecasting algorithm
5.1 Working process
In General, the process of implementing this method is not complicated. Initially, you need to get a key to connect to the Meteostat API and select the necessary criteria for requesting a JSON file.
Figure 28. Getting Data through the API.
The next step in processing the received data is to read the daily temperatures for the selected period and divide it into lists of 7 elements, as many as are in one week.
Figure 29. Creation of Week list array.
To begin with, a pre-forecast week was introduced, which had a mathematical expectation calculated.The similarity coefficient described is the Pearson correlation coefficient. Which can be calculated using the pearsonr function.
Figure 30. Pearson correlation coefficient for each week.
After calculating the coefficient for each week, we conduct a comparative analysis. As the first rule of fuzzification is considered to be a necessary condition for a value in excess of 0.625. In the current selection, we have only one value that fits this rule. K = 0.82.
Figure 31. Graph of Kxy correlation between pre-forecast week and its historical analog
By transferring the values of both weeks to the format of two-dimensional coordinates and building a Kxy line, we can calculate the angle of inclination of this line. Arctg(ц) = 1.1, then the angle is ц = 47.7. then it's needed to look through fully apply the fuzzification rule.
Figure 32. Results of final fuzzification rules.
So, finally we have : the Kxy is “Strong” (0.82) and цХУ(47.2) “Average” THEN Similarity “Full”.
In the course of these manipulations, we have identified the most similar analog to the pre-forecast period, which means that its subsequent week can be considered as similar to the predicted week. Therefore, for the elementary calculation of approximate predicted values, based only on a single measurement of air temperature, the difference of two weeks was calculated piecemeal and the result was transferred to the historical analog. As a result, the following results were obtained:
Figure33. Predicted values of weather for searched week.
6.Chapter six : conclusion
6.1 Summary
In the course of current research, we have studied and applied the method of weather forecasting using the similarity method. We analyzed cases from two different areas of public life to understand the principle of choosing software in a particular case. Moreover, an example of one of the modern methods of forecasting time series using SARIMA was given. Further tactics to improve this method were suggested.
Подобные документы
Проблемы оценки клиентской базы. Big Data, направления использования. Организация корпоративного хранилища данных. ER-модель для сайта оценки книг на РСУБД DB2. Облачные технологии, поддерживающие рост рынка Big Data в информационных технологиях.
презентация [3,9 M], добавлен 17.02.2016Data mining, developmental history of data mining and knowledge discovery. Technological elements and methods of data mining. Steps in knowledge discovery. Change and deviation detection. Related disciplines, information retrieval and text extraction.
доклад [25,3 K], добавлен 16.06.2012Классификация задач DataMining. Создание отчетов и итогов. Возможности Data Miner в Statistica. Задача классификации, кластеризации и регрессии. Средства анализа Statistica Data Miner. Суть задачи поиск ассоциативных правил. Анализ предикторов выживания.
курсовая работа [3,2 M], добавлен 19.05.2011A database is a store where information is kept in an organized way. Data structures consist of pointers, strings, arrays, stacks, static and dynamic data structures. A list is a set of data items stored in some order. Methods of construction of a trees.
топик [19,0 K], добавлен 29.06.2009Описание функциональных возможностей технологии Data Mining как процессов обнаружения неизвестных данных. Изучение систем вывода ассоциативных правил и механизмов нейросетевых алгоритмов. Описание алгоритмов кластеризации и сфер применения Data Mining.
контрольная работа [208,4 K], добавлен 14.06.2013Совершенствование технологий записи и хранения данных. Специфика современных требований к переработке информационных данных. Концепция шаблонов, отражающих фрагменты многоаспектных взаимоотношений в данных в основе современной технологии Data Mining.
контрольная работа [565,6 K], добавлен 02.09.2010Основы для проведения кластеризации. Использование Data Mining как способа "обнаружения знаний в базах данных". Выбор алгоритмов кластеризации. Получение данных из хранилища базы данных дистанционного практикума. Кластеризация студентов и задач.
курсовая работа [728,4 K], добавлен 10.07.2017Історія виникнення комерційних додатків для комп'ютеризації повсякденних ділових операцій. Загальні відомості про сховища даних, їх основні характеристики. Класифікація сховищ інформації, компоненти їх архітектури, технології та засоби використання.
реферат [373,9 K], добавлен 10.09.2014Роль информации в мире. Теоретические основы анализа Big Data. Задачи, решаемые методами Data Mining. Выбор способа кластеризации и деления объектов на группы. Выявление однородных по местоположению точек. Построение магического квадранта провайдеров.
дипломная работа [2,5 M], добавлен 01.07.2017Общее понятие о системе Earth Resources Data Analysis System. Расчет матрицы преобразования космоснимка оврага. Инструменты геометрической коррекции, трансформирование. Создание векторных слоев. Оцифрованные классы объектов. Процесс подключения скрипта.
курсовая работа [4,3 M], добавлен 17.12.2013