Power Saving Evaluation with Automatic Offloading

Heterogeneous hardware other than small-core CPU such as GPU, FPGA, or many-core CPU is increasingly being used. However, heterogeneous hardware usage presents high technical skill barriers such as familiarity with CUDA. To overcome this challenge, I previously proposed environment-adaptive software that enables automatic conversion, automatic configuration, and high-performance and low-power operation of once-written code, in accordance with the hardware to be placed. I also previously verified performance improvement of automatic GPU and FPGA offloading. In this paper, I verify low-power operation with environment adaptation by evaluating power utilization after automatic offloading. I compare Watt*seconds of existing applications after automatic offloading with the case of CPU-only processing.


Introduction
As Moore's Law slows down, a central processing unit's (CPU's) transistor density cannot be expected to double every 1.5 years. To compensate for this, more systems are using heterogeneous hardware, such as graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and so on. For example, Microsoft's search engine Bing uses FPGAs (1), and Amazon Web Services (AWS) provides GPU and FPGA instances (2) using cloud technologies (3)- (10). Systems with Internet of Things (IoT) devices are also increasing (11)- (20).
However, to properly utilize devices other than small-core CPUs in these systems, configurations and programs must be made that consider device characteristics, such as Open Multi-Processing (OpenMP) (21), Open Computing Language (OpenCL) (22), and Compute Unified Device Architecture (CUDA) (23). In addition, programmers need to be sufficiently skilled at using embedded software to precisely control IoT devices. Therefore, for most programmers, the skill barriers are high.
In short, the expectations for applications using heterogeneous hardware are becoming higher, but the skill hurdles for using them are currently high. To surmount these hurdles, application programmers should only need to write logics to be processed, and then software should adapt to the environments with heterogeneous hardware to make it easy to use such hardware.
Java (24), which appeared in 1995, caused a paradigm shift in environment adaptation that enables software written once to run on another CPU machine. However, no consideration was given to the application performance and power consumption at the porting destination. Therefore, I previously proposed environment-adaptive software that effectively runs once-written applications with high performances and low power by automatically converting and configuring code so that GPUs, FPGAs, many-core CPUs, and so on can be appropriately used in deployment environments. For an elemental technology for environment-adaptive software, I also proposed a method for automatically offloading loop statements and function blocks of applications to GPUs or FPGAs and improved performances (25)(26) (27). However, only processing time after offloading has been evaluated, and power consumption after offloading has not been evaluated so far.
This paper proposes a method that takes power consumption into consideration when offloading a normal CPU program to a device such as an FPGA to improve its performance, and verifies the reduction of power consumption after offloading of existing applications. I also propose a method to automatically select an appropriate offload destination in consideration of the power The rest of this paper is organized as follows. In Section 2, I review technologies on the market and our previous proposals. In Section 3, I propose an automatic offloading method for GPUs and FPGAs that takes power consumption into consideration. In Section 4, I evaluate the power consumption during automatic FPGA offload with the existing application. In Section 5, I conclude the paper.

Technologies on the market
Java is one example of environment-adaptive software. In Java, by using a virtual execution environment called Java Virtual Machine, written software can run even on machines that use different operating systems (OSes) without more compiling (Write Once, Run Anywhere). However, whether the expected performance could be attained at the porting destination was not considered, and too much effort was involved in performance tuning and debugging at the porting destination (Write Once, Debug Everywhere).
CUDA is a major development environment for general-purpose GPUs (GPGPUs (28) that use GPU computational power for more than just graphics processing. To control heterogeneous hardware such as FPGA and GPU uniformly, the OpenCL specification and its software development kit (SDK) are widely used (29). CUDA and OpenCL require not only C language extension but also additional descriptions such as memory copy between devices and CPUs. Because of these programming difficulties, there are few CUDA and OpenCL programmers.
For easy heterogeneous hardware programming, there are technologies that specify parallel processing areas by specified directives, and compilers transform these specified parts into device-oriented codes on the basis of the specified directives. Open accelerators (OpenACC) (30) and OpenMP are examples of directive-based specifications, and the Portland Group Inc. (PGI) compiler (31) and gcc are examples of compilers that support these directives.
In this way, CUDA, OpenCL, OpenACC, OpenMP, and others support GPU, FPGA, or many-core CPU offload processing. Although processing on devices can be done, sufficient application performance and power reduction are difficult to attain. For example, when users use an automatic parallelization technology, such as the Intel compiler (32) for multi-core CPUs, possible areas of parallel processing such as "for" loop statements are extracted. However, naive parallel execution performances with devices are not high because of overheads of CPU and device memory data transfer. To achieve high application performance with devices, CUDA, OpenCL, or so on need to be tuned by highly skilled programmers, or an appropriate offloading area needs to be searched for by using the OpenACC compiler or other technologies.
Therefore, users without skills in using GPU, FPGA, or many-core CPU will have difficulty attaining high application performance and power reduction. Moreover, if users use automatic parallelization technologies to obtain high performance, much effort is needed to determine whether each loop statement is parallelized.

Previous proposals
On the basis of the above background, to adapt software to an environment, I previously proposed environment-adaptive software (26), the processing flow of which is shown in Figure 1. The environment-adaptive software is achieved with an environment-adaptation function, test-case database (DB), code-pattern DB, facility-resource DB, verification environment, and production environment.
Step 1: Code analysis Step 2: Offloadable-part extraction Step 3: Search for suitable offload parts Step 4: Resource-amount adjustment Step 5: Placement-location adjustment Step 6: Execution-file placement and operation verification Step 7: In-operation reconfiguration Because most offloading to heterogeneous devices is currently done manually, I proposed the concept of environment-adaptive software and automatic offloading to heterogeneous devices. For automation, I also have proposed a method using evolutionary computation to search for appropriate parallel processing parts when offloading to a GPU. However, the previous paper only evaluated the shortening of the processing time and not the reduction of power consumption. Therefore, in this paper, I evaluate the reduction of power consumption when automatically offloading to an offload device such as FPGA.

Automatic GPU and FPGA Offload Considering Power Consumption
To embody the concept of environment-adaptive software, I have so far proposed automatic GPU and FPGA offload of program loop statements, automatic offload of program functional blocks, and multilingual and mixed environments offload. Based on these elemental technologies, in subsections 3.1 and 3.2, I propose automatic GPU and FPGA offload technology for loop statements that take power consumption into consideration. In 3.3, I propose an appropriate offload destination selection technology in a mixed environment of migration destinations.

Automatic GPU offload of loop statements
For automatic GPU offloading of loop statements, I proposed a method and evaluated processing time improvement (33).
First, as a basic problem, the compiler can find the limitation that this loop statement cannot be processed in parallel on the GPU, but it is difficult to find out whether this loop statement is suitable for parallel processing on the GPU. Loop statements with a large number of loops are generally said to be more suitable, but it is difficult to predict the performance and power consumption by offloading to the GPU without actually measuring them. Therefore, it is often the case that the instruction to offload this loop to the GPU is manually given and the performance measurement is tried. On the basis of that, (33) proposed automatically finding an appropriate loop statement that is offloaded to the GPU with a genetic algorithm (GA) (34), which is an evolutionary computation method. From a general-purpose program for normal CPUs, the proposed method first checks the parallelizable loop statements. Then for the parallelizable loop statements, it sets 1 for GPU execution and 0 for CPU execution. The value is set and geneticized, and the performance verification trial is repeated in the verification environment to search for an appropriate area. Here, the pattern that can be processed in a short time in the verification environment measurement is regarded as a gene with high goodness of fit. In this paper, the power consumption is also measured in the verification environment measurement, and a new process is added to make the high goodness of fit for the low power consumption pattern. For example, (Processing time)$^{-1/2}$*(Power consumption)$^{-1/2}$ is set to increase goodness of fit value for short processing time and low power consumption ( Figure 2).
(33) also proposed a method for transferring variables efficiently. Regarding the variables used in the nested loop statement, when the loop statement is offloaded to the GPU, the variables that have no problems even if CPU-GPU transfer is performed at the upper level are summarized at the upper level. In addition, for not only nesting but also variables defined in multiple files, GPU processing and CPU processing are not nested, and for variables where CPU processing and GPU processing are separated, the proposed method specifies to transfer them in a batch.
In summary, I propose an evolutionary computation method that includes power consumption in the goodness of fit and a reduction in CPU-GPU transfer. By using them, the speed is increased and the power is reduced automatically.

Automatic FPGA offload of loop statements
I have also proposed a method for offloading loop statements to FPGA to improve performance (26).
When considering offloading a specific loop statement that takes a long time to speed up to FPGA, it is difficult to predict which loop should be offloaded to speed up. Therefore, it is proposed that the performance measurement be performed automatically in the verification environment similar to GPU offload case. However, since it takes several Fig. 3. Automatic FPGA offload method considering power consumption hours or more for an FPGA to compile OpenCL and operate it on an actual machine, it takes a huge amount of processing time to repeatedly measure the performance using GA like automatic GPU offload.
Therefore, after narrowing down the candidate loop statements to be offloaded to the FPGA, the performance measurement trial is performed. Specifically, first, for the found loop statement, a loop statement with high arithmetic intensity is extracted using an arithmetic intensity analysis tool such as the ROSE framework (35). Furthermore, loop statements with a large number of loops are also extracted using a profiling tool such as gcov or gprof. OpenCL codes are created using candidate loop statements with a large number of arithmetic intensity and loops. At the time of OpenCL creation, the CPU processing program is divided into the kernel (FPGA) and the host (CPU) in accordance with the OpenCL syntax. For offload candidate loop statements with a large number of arithmetic intensity and loops, our method precompiles the created OpenCL to find a loop statement with high resource efficiency. This is because the resources such as Flip Flop and Lookup Table to be created are known in the middle of compilation, so the loop statements that use a sufficiently small amount of resources are further narrowed down. Since some candidate loop statements remain, our method measures the performance and power consumption using them.
Our method compiles and measures the selected single-loop statement so that it works on the actual FPGA. For a single-loop statement that can be further speeded up, a pattern of the combination is also created and the second measurement is performed. Among the multiple patterns measured in the verification environment, a short-time and low-power pattern is selected as the final solution. For short time and low power, our method uses the similar evaluation value as for GPU (Figure 3).
In summary, after narrowing down the candidate loop statements using the arithmetic intensity, the number of loops, and the resource efficiency, the measurement is performed in the verification environment to increase the evaluation value of the low-power pattern. With them, the speed is increased and the power is reduced automatically.

Automatic offload to mixed environments
I also studied a technology to select an appropriate migration destination while GPU, FPGA, and many-core CPU are mixed as migration destinations.
I propose the following order of verification with three offloads: many-core CPU loop statement offload, GPU loop statement offload, and FPGA loop statement offload. With automatic offload, pattern search is expected to be as quick as possible. Therefore, FPGA verification that takes a long time is the last, and if a pattern that sufficiently satisfies the user requirements is found in the previous stage, FPGA verification will not be performed. There is no big difference in price and verification time between GPU and many-core CPU, but the difference between many-core CPU and normal CPU is smaller than that of GPU with different memory and different devices. Therefore, the verification order is to start with the many-core CPU, and if a pattern that sufficiently satisfies the user requirements is found in the many-core CPU, GPU verification will not be performed.
Here, the previous method is to verify the three migration destinations and automatically select the high-speed migration destination. However, in this paper, not only the short processing time but also the low-power migration destination is automatically selected through the actual measurement in the verification environment. For example, (Processing time)$^{-1/2}$*(Power consumption)$^{-1/2}$ is set to increase evaluation value for short processing time and low power consumption.
As a typical data center cost, the initial cost such as hardware and development cost is 1/3 of the total cost, the operation cost such as power and maintenance is 1/3, and the other cost such as service order is 1/3. In this case, for example, the processing time will be reduced to 1/5, and the initial cost will be reduced if the number of hardware is halved even if the CPU and GPU are combined. Half the power consumption also leads to a reduction in operation cost. However, operation costs have many factors other than electric power, and halving the power consumption does not halve the operation costs. In addition, the hardware price also varies depending on the operator, such as volume discount depending on the number of GPUs and FPGA servers to be installed. Therefore, the evaluation formula needs to be set differently for each business operator.
In this way, in this paper, the appropriate offload destination is automatically selected in consideration of not only the processing time but also the power consumption.

Evaluation
The automatic offloading to GPU and FPGA of the loop statement has already been evaluated (33). In this paper, when determining the evaluation value of the measurement pattern on the basis of the implementation of the previous papers, offloading is performed by modifying the implementation to increase the evaluation value of lower power consumption. I will show that the processing time and the power consumption are both reduced by offloading.

Evaluation Condition
(a) Evaluated application Evaluated application for FPGA offloading is magnetic resonance imaging (MRI) image processing of MRI-Q. This is used in many cases.
MRI-Q (36) computes a matrix Q, representing the scanner configuration for calibration, used in 3D MRI reconstruction algorithms in non-Cartesian space. In an IoT environment, image processing is often necessary for automatic monitoring from camera videos, and performance enhancements are requested in many cases. During application performance measurement, MRI-Q executes 3D MRI image processing to measure processing time using 64*64*64 size sample data. MRI-Q is an application written by C language. It is processed on CPU on the basis of C language logic and processed on FPGA on the basis of OpenCL logic converted from C language codes.
(b) Evaluation method In the experiment, we enter the code of the evaluation application and the implementation tries to offload the loop statement recognized by an analysis library such as Clang (37) to FPGA. For FPGA offloading, the arithmetic intensity and other values are used to narrow down the measurement patterns to four. At the time of trial, the processing time and power consumption are measured. For the finally determined offload pattern, the time change of the power consumption is acquired, and the improvement of the power consumption compared with the case where all the processing is performed by the normal CPU without offload is confirmed.
Number of processable loop statements. 16 for MRI-Q. Evaluation value: (Processing time)$^{-1/2}$*(Power consumption)$^{-1/2}$. When processing time and power consumption become smaller, the evaluation value becomes larger. If the performance measurement does not complete in 3 minutes, a timeout is issued, and processing time is set to 1,000 seconds to calculate evaluation value.
(c) Experiment environment I used physical machines with Intel PAC with an Intel Arria10 GX FPGA for offloading verification and Intel Acceleration Stack Version 1.2 (38) for FPGA control. Regarding power consumption, ipmitool (39) of IPMI (Intelligent Platform Management Interface) equipped with Dell PowerEdge R740 acquired power consumption of a whole server. Figure 4 shows the experimental environment and specifications. Figure 5 shows Watt and time when MRI-Q was offloaded to FPGA. From the figure, the processing time has been shortened from 14 to 2 seconds compared with the case where all CPU processing is performed. It is also found that the power has been changed from about 121 Watt with only CPU to 111 Watt with CPU and FPGA. As a result, Watt*sec changed from 1,690 Watt*sec when processing only the CPU to 223 Watt*sec when offloaded to the FPGA.

Results
As an application expected to be used by many users, I confirmed the speedup and power reduction of MRI-Q for image processing. When offloaded to the FPGA, the Watt of the whole devices with CPU and FPGA is reduced slightly, which is combined with the shortening of the processing time, resulting in a significant reduction in power consumption. The amount of power consumption during GPU offload is not measured this time, but in a GPU and FPGA mixed environment, an appropriate offload destination is selected from the measured performance and power.

Conclusions
In this paper, I proposed and evaluated an offload method that considers power consumption as an element of environment-adaptive software for operating applications with high performance and low power.
When actually measuring in the verification environment during automatic GPU and FPGA offloading trials, the power consumption is acquired along with the processing time, and the short-time and low-power pattern is made highly suitable to reduce the power for automatic code conversion. When GPU, FPGA, or so on are mixed, automatic selection is performed by trying migration to a single migration destination and selecting a migration destination with low power consumption and high performance. Through the automatic FPGA offloading of MRI-Q, I demonstrated the high performance and low power consumption and verified the effectiveness of the method.
In the future, I will verify the reduction of power consumption with more applications for both FPGA and GPU. In addition, the evaluation formulas for shortening the processing time and reducing the power consumption will be examined with reference to specific examples of the cost structure of the business operator.