A Fast CU Division Algorithm for Intra Prediction in HEVC

High Efficiency Video Coding (HEVC) is a new video coding standard following H.264/AVC. By introducing a flexible coding unit (CU), which can be recursively divided from 64 64  to 8 8 blocks in a Quadtree-Structure, HEVC achieves significantly higher coding efficiency than the previous standards. With the flexible CU structure, HEVC can effectively adapt to complicated contents with a smaller CU or to flat contents with a larger CU, making it suitable for applications from mobile video to super high definition television (HDTV). On the other hand, CU division does incur high computational cost for HEVC. In this paper, we propose a simple and fast CU division algorithm by using only a subset of pixels to determine when CU division happens. Experiment results show that our algorithm can achieve prediction quality close to HEVC Test Model (HM) with much lower computational cost.


Introduction
The High Efficiency Video Coding (HEVC) is the most recent video coding standard after H.264/AVC.Similar to MPEG-2/4 and H.264/AVC, HEVC uses a hybrid coding method based on motion-compensated prediction and DCT coding on blocks.In HEVC, however, because of the introduction of advanced techniques such as block division structure and intra prediction mode, compression rate is about doubled compared to H.264/AVC [1]- [5].
For intra and inter predictions HEVC uses a variable block division scheme that allows a block to be divided into a set of sub-blocks of different sizes.Such block division scheme consists of 4 units: Coding Tree Unit (CTU), Coding Unit (CU), Prediction Unit (PU), and Transform Unit (TU).A picture to be coded is first divided into a number of CTUs of 64 64  pixel size, and then each CTU is divided into smaller CUs of variable sizes but no smaller than 8 8  pixels.The CU division process follows so-called Quadtree-Structure and can be recursively repeated.
Fig. 1 shows an example of recursive division of a CU in Z scan.Initially a CU has the same size of 64 64  as the CTU at depth 0. When the CU is divided, it will be split into four sub-CUs of 32 32  pixels at depth 1.The division process can repeat until depth 3 when the divided CUs reach the minimal size of 8 8  pixels.When division is finished on a CU, PU is run on the eight possible partitions for inter coding and the two possible partitions for intra coding.Such CU division scheme is what gives HEVC the advantage in coding efficiency over the previous standards.
In HEVC, the optimal CU can be estimated using the Ratio-Distortion (RD) cost computation defined as follows, where SSE denotes the sum of squared errors,  is a coefficient dependent on the quantization parameter (QP), and R is the total bits for encoding.In the HEVC Test Model (HM), which is the standard implementation of HEVC, all CU divisions from depth 0 to depth 3 and all PU partitions require RD to be computed, resulting in a large number of calculations and a significant complexity for HEVC.To reduce such complexity several algorithms have been proposed that target the reduction of RD calculation and CU division [6]- [10].
In this paper, we propose a simple and fast CU division algorithm in which only a few pixels around a CU being processed are analyzed instead of the RD calculation to decide whether the CU is to be divided.If there is a large variability among those pixels the CU will be divided; otherwise division stops.

Algorithm
In the proposed method, the variability of pixel intensities around the current CU is estimated to decide if the CU should be divided.Such variability indicates the degree of complexity of a CU due to contents and/or motion.A CU will only be divided when the variability is high.To minimize computation when estimating the variability, we propose to check only a limited number of pixels around the CU instead of checking all of the pixels.
Fig. 2 shows four search directions from the current CU (gray block).In each search direction only pixels within 64 pixels plus half the size of CU from the center are checked.For example, if the current CU is of 32 32  pixels, then 80 (64+32/2) pixels will be checked.Searching all four directions will thus require 320 ( 80 4  ) pixels to be checked.Our algorithm for CU division is therefore as follows.

Algorithm
Step1.Get a CTU from image as an initial CU of 64 64  pixels, set depth = 0.
Step3.Find an unprocessed CU in the current depth.
Step4.For this CU get the brightness value of the pixels in four search directions and calculate their variances.
Step5.For the four variance values, count those greater than a pre-decided threshold T and record the count in N.
Step6.If N 2  , the current CU is divided into four sub-CUs, depth = depth + 1, go back to Step 2.
Step8.If depth = 3 or all the CUs in current depth have been processed depth = depth -1, go back to Step 2.
Step9.The sizes of all CUs have been determined. Step10.Return.

Determination of Thresholds
In our algorithm there can be at most 4 search steps (0-3), and in only 3 of them (0-2) is the decision needed on whether to further divide a CU.We allow three different thresholds (T0-T2) to be used in depths 0 to 2, respectively, and they are empirically determined using two types of standard test images, People and Landscape.Coding performance is evaluated with PSNR and coding time.PSNR is calculated as follows, where MAXVAL is the maximal pixel value of original image, and MSE is the mean square error calculated between the original image and the coded image.
Fig. 3 shows how PSNR and coding time change in depth 0 when we vary threshold T0 from 0 to 1000 in step 100.When T0 is increased from 0 to around 300, PSNR either continuously decreases (People) or stays relatively stable (Landscape).When T0 goes above 300, PSNR may change in different directions for the two types of sequences.Coding time, on the other hand, decreases with increasing threshold over the entire range for both sequences.Balancing both PSNR and coding time, we consider threshold T0 =300 acceptable for depth 0.
Similarly, thresholds T1 for depth 1 and T2 for depth 2 are selected to be 800 and 2800, respectively.

Simulation Results
In simulation experiments our algorithm is tested with PSNR and coding time for 6 standard test images in comparison with HM-16.7,where QP is set to 22, 27, 32 and 37 for both methods.
Table 1 and Table 2 show the coding time and PSNR, respectively, for HM, while Table 3 and Table 4 show the values for our algorithm.Comparing Table 1 with Table 3, our proposed algorithm is much faster than HM.On the other hand, Table 2 and Table 4 show that PSNR of our algorithm is slightly lower than that of HM.For easy comparison, Table 5 and Table 6 show the differences in coding time (saving) and in PSNR between the two methods.Table 5 demonstrates that for all the images our algorithm saves between 44.5% (Sailboat, QP: 37) to 68.1% (Splash, QP: 22) in coding time, or about 58% in average.Table 6 shows that PSNR of the proposed algorithm is close Enlarged part coded with proposed algorithm to that of HM with an average difference of about 0.23% for all the QP selections.
To further understand the visual characteristics of the differences between the two coding methods, we compared the images coded by HM and our algorithm.Fig. 4 shows the result of Splash with QP=37 by using HM and our algorithm.Table 5 and Table 6 demonstrate that our algorithm has a time saving of 67% at a cost of 0.30[dB] decrease in PSNR compared to HM.When Splash is enlarged as shown in Fig. 5, we can see that the edges of splash encoded with our algorithm are blurred (blue circle).However, for the same enlarged part encoded with QP=22 shown in Fig 6, our algorithm gives the same clear edges (green circle) as HM.Similar trend in quality for different QP values is also observed in the House image shown in Fig. 7 to Fig. 9.The results suggest that compared to HM our algorithm provides best time saving for images encoded with larger QP at the cost of losing some minor details, but for images encoded with smaller QP the visual differences are negligible.But overall, for most commonly selected QP values our algorithm can achieve almost the same visual effects as HM.

Conclusions
In this paper we proposed a novel CU division algorithm to reduce computation complexity of HM while keeping coding quality similar to that of HM.Our algorithm is simple and fast and achieves similar visual effects to HM by utilizing only a few pixels around the CU.The experiment results demonstrated that our algorithm is indeed effective and can be used in HEVC intra prediction.For future improvement, one can consider taking into account the CUs for which decisions to divide have been made and the fact that motion is often spatially clustered as a way to further reduce coding time.

Fig. 3
Fig. 3 Determination of threshold T0 based on two test images

Table 3 .
Coding time of proposed algorithm (Sec.)

Table 5 .
Time saving of proposed algorithm (%)

Table 6 .
The difference of PSNR for two methods