MapReduce Design for Process Discovery using Passages

This study proposed a MapReduce design for a passage-based distributed process discovery algorithm, which can discover a process model by analyzing stored event logs depending on the progress of the process. This study can be used for the diagnosis, variation measurement, and improvement of the process by discovering the process model in a smart factory environment with an autonomously changing process.


Introduction
The concept of smart factories is gaining significance all over the world to cope with intensifying global manufacturing competition and to maintain future competitiveness.In the future, the smart factory will evolve into a self-adaptive factory where all objects exchange information by connecting to each other.The physical world and the cyber world will organically converge and communicate, and objects by themselves will be able to recognize and determine situations (1)(2)(3).
When the future self-adaptive smart factory is implemented, manufacturing processes will change depending on the situation.In addition, it is expected that variations in the process will increase, and the traceability and process management of the manufacturing sites will become more and more difficult and complicated (4).In particular, measuring and understanding variations in the manufacturing process can be a great help in understanding the current performance level.However, if it is difficult to measure the variation, it is also difficult to identify the current performance level and to define the optimized process model.
A process mining technique can be effectively utilized in such an environment where various processes exist.Process mining is a series of activities that discover a process model from the logs accumulated in the operational database, check conformance between the predefined process model and the actually executed process logs, and enhance the process model by identifying bottlenecks or anomalies (5).By applying a process mining technique, we can analyze a process that has changed and has been executed depending on the situation, determine the optimal process from the overall perspective, and enhance self-adaptive rules.
A large amount of IoT data and manufacturing execution data will be generated and stored in big data storage in the smart factory environment.To discover processes from massive data, existing process mining techniques must be implemented with big data parallel processing techniques such as MapReduce.With these in mind, this study proposes a method to discover a parallel process model from the large-scale process logs based on the assumption of a smart factory with a big data system in place.Of many process discovery algorithms, this study aims to design the passage-based distributed process discovery method proposed by Aalst et al. (6) with MapReduce, which is a parallel processing programming model.

Related Works
The trend of applying big data technology to process DOI: 10.12792/iciae2018.058mining has been emphasized by many work (7,8).Requieg et al. (9) suggested two-steps approach to perform events correlation discovery using a MapReduce framework, and developed the method partitioning event logs at the reducer to reduce workload.
Evermann (10) has applied the MapReduce framework to alpha algorithm and heuristics miner algorithm, and evaluated the scalability of the algorithms by calculating the total job exit time based on the performance of task nodes.

Process Discovery using Passage
The passage-based process discovery ( 6) divides the log into sublogs, extracts net fragments from all sublogs, and then combines all net fragments into a single entire system net instead of discovering the process by analyzing the entire log at once.The following is an example proposed by Aalst et al.

MapReduce for Passage-Based Process Discovery
This section describes how to design a passage-based process discovery with a three-step MapReduce.It is assumed that the process log consists of objectID, readPoint, and timestamp, as shown in Table 1.These correspond to caseID, activity, and timestamp, respectively Next, the start and end nodes are added for each case in Reduce1, and the readPoint pair that is in a direct follow relationship between each readPoint is extracted.The key of the output is the readPoint pair.'i' is added after the readPoint name for the input node, and 'o' is added after the readPoint name for the output node.The value is the log number, which is the key for collecting all readPoint pairs in the next MapReduce and is a meaningless value that is not used in the calculation.This is summarized in Fig. 6.

Conclusions
This study proposed a design method to implement a process mining technique that can be used in a self-adaptive smart factory to be constructed in the near future with MapReduce, which is a parallel processing technology.First, we reviewed a passage-based process discovery method and designed a three-step MapReduce function.In addition, we described the principle of the MapReduce design by showing an example of the output of each step.This study is significant in that it opens the possibility to improve the computation performance by designing the proposed algorithm with MapReduce, which is a parallel processing programming model.
Future research is required to verify the performance using Hadoop MapReduce, MapReduce of NoSQL databases, and Spark, which are technologies to implement the MapReduce design presented in this study.

Fig. 4 Fig. 5 .
Fig.4shows the entire MapReduce flow consisting of three steps, and the input and output of Map and Reduce of each step.The following is a detailed description of each step.The pairs (readPoint, timestamp) are extracted using the objectID as key in Map1 of the first MapReduce.An

Table 1 .
Example of process log.