This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Task modules Partitioning, Scheduling and Floorplanning for Partially Dynamically Reconfigurable Systems Based on Modern Heterogeneous FPGAs

Bo Ding Department of Electronic Science and TechnologyUniversity of Science and Technology of ChinaHefeiChina [email protected] Jinglei Huang Department of Electronic Science and TechnologyUniversity of Science and Technology of ChinaHefeiChina [email protected] Junpeng Wang Department of Electronic Science and TechnologyUniversity of Science and Technology of ChinaHefeiChina [email protected] Qi Xu Department of Electronic Science and TechnologyUniversity of Science and Technology of ChinaHefeiChina [email protected] Song Chen Department of Electronic Science and TechnologyUniversity of Science and Technology of ChinaInstitute of Artifcial Intelligence, Hefei Comprehensive National Science CenterHefeiChina [email protected]  and  Yi Kang Department of Electronic Science and TechnologyUniversity of Science and Technology of ChinaInstitute of Artifcial Intelligence, Hefei Comprehensive National Science CenterHefeiChina [email protected]
Abstract.

Modern field programmable gate array(FPGA) can be partially dynamically reconfigurable with heterogeneous resources distributed on the chip. And FPGA-based partially dynamically reconfigurable system(FPGA-PDRS) can be used to accelerate computing and improve computing flexibility. However, the traditional design of FPGA-PDRS is based on manual design. Implementing the automation of FPGA-PDRS needs to solve the problems of task modules partitioning, scheduling, and floorplanning on heterogeneous resources. Existing works only partly solve problems for the automation process of FPGA-PDRS or model homogeneous resource for FPGA-PDRS. To better solve the problems in the automation process of FPGA-PDRS and narrow the gap between algorithm and application, in this paper, we propose a complete workflow including three parts, pre-processing to generate the list of task modules candidate shapes according to the resources requirements, exploration process to search the solution of task modules partitioning, scheduling, and floorplanning, and post-optimization to improve the success rate of floorplan. Experimental results show that, compared with state-of-the-art work, the proposed complete workflow can improve performance by 18.7%, reduce communication cost by 8.6%, on average, with improving the resources reuse rate of the heterogeneous resources on the chip. And based on the solution generated by the exploration process, the post-optimization can improve the success rate of the floorplan by 14%.

Heterogeneous resources, Partition, Schedule, Floorplan, RRs, PDRS, ILP
ccs: Hardware Partitioning and floorplanningccs: Hardware Physical design (EDA)

1. Introduction

With the emergence of intelligent fields such as big data, cloud computing, edge computing, neural network, and others, the applications may be compute-intensive or memory-intensive and therefore put forward increasingly requirements on the ability of hardware platforms (liu2019survey, ). During so many hardware platforms, the platforms that can be reconfigurable have drawn much attention from industry and academia because of possessing lots of superiority in the design of accelerators(wang2017reconfigurable, ). The reconfigurable hardware platform FPGA has the characteristics of software definition and hardware realization, has high flexibility, and can significantly shorten the design cycle compared to ASIC (vipin2018fpga, ). Therefore, FPGA with reconfigurable technology can be the hardware platform to face the challenge that various computing applications may be compute-intensive or memory-intensive.

Traditionally, we design and implement circuits and then deploy these circuits on FPGA, which is a static way to speed up applications. When the FPGA is running, we have to power off the FPGA and configure it to new circuits if we need to perform new functions. With the development of FPGA technology, dynamic reconfiguration (DR) has been developed with increasing flexibility. DR brings higher-level automation to reconfigure FPGA by changing predetermined functions at run-time. However, the entire chip must be reconfigured to implement new circuits each time with DR technology. Thus, significant reconfiguration overhead is incurred because of loading the configuration file each time(hauck2010reconfigurable, ). Several techniques have emerged to reduce the impact of reconfiguration overhead, such as module reuse, configuration prefetching, and partial dynamic reconfiguration(PDR). During these technologies, PDR can optimize performance, reduce power consumption, and improve design flexibility because PDR technology can partition the chip into several reconfigurable regions and configure them into corresponding circuits. These regions can be executed in parallel without affecting each other(koch2012partial, ).

1.1. Related Works

The design automation of partially dynamically reconfigurable systems involves the problems of task modules partitioning, scheduling, and floorplanning on heterogeneous resources. The analysis of these problems is described in Section 2.3. Many previous works have been committed to designing algorithms to implement the automation of partially dynamically reconfigurable systems. However, these works are more or less impaired. And Table 1 lists part representative works.

Some works have focused on partition and floorplan of PDRS to decrease wire length and improve resources utilization. Bolchini et al. (bolchini2011automated, ) proposed a floorplanner to place reconfigure regions(RRs). This work is based on the SA algorithm, and the objective is to improve the utilization of heterogeneous resources of the PDRS with optimizing wire length. However, this work only considers the floorplan of reconfigurable regions and only solves part problems of the design automation of PDRS. P. Banerjee et al.(banerjee2010floorplanning, ) proposed a three-stages heuristic method to partition and floorplan task modules on heterogeneous resources of PDRS. This work limits that each reconfigurable region deploys one task module in each partition. However, on a modern FPGA chip with PDR technology, each reconfigurable region can deploy multiple task modules, thus reducing control and communication costs and avoiding congestion. Yao et al. (yao2022fast, ) proposed the fast search and efficient algorithm for the floorplan of task modules on modern heterogeneous FPGAs with increasing both floorplan speed and success rate. However, this work only considers the cost of resource utilization and ignores other costs, such as the cost of wire length. These works (bolchini2011automated, ; banerjee2010floorplanning, ; yao2022fast, ) do not consider the task modules scheduling that affects the performance of PDRS. Besides, some other excellent works also only consider the partition and floorplan in PDRS but don’t consider schedule. (rabozzi2015floorplanning, ; liu2013resource, ; montone2010placement, ; singhal2006multi, ).

And some other works have focused on partition and schedule. Ganesan et al. (ganesan2000integrated, ) proposed a reconfigurable processor system with two reconfigurable regions for performance acceleration. This work exploits the characteristics of PDR to hide the configuration overhead and improve the parallelism to speed up. However, modern FPGA chips with PDR technology can support multiple reconfigurable regions, and thus this problem is more complicated. Rana et al.(rana2009minimization, ) proposed a reconfigurable overhead minimization method based on the analysis of the communication graph. This work tries to group task modules that require simultaneous reconfguration into the same reconfigurable region. However, the designer must appoint the number of the reconfigurable regions manually. R. Cordone et al. (cordone2009partitioning, ) proposed an integer linear programming (ILP) model and a heuristic method for task graph partitioning and scheduling. They exploit the characteristics of task module reuse and configuration prefetching to minimize reconfiguration overhead. (jiang2007temporal, ; purgato2016resource, ; tang2020partitioning, ) also contribute to task module partitioning and scheduling in PDRS.

However, task modules partitioning affects schedule and floorplan, and these three factors significantly impact the performance, resources utilization, and energy consumption of PDRS. And some works have focused on solving the problems of task modules partitioning, scheduling, and floorplanning of PDRS integrated. P. Yuh et al. (yuh2004temporal, ; yuh2007temporal, ) modeled the task modules as three-dimensional (3D) boxes and proposed an SA-based 3D floorplanner to floorplan and schedule task modules. However, assuming task modules can be reconfigured at any time and in any region may not match practically reconfigurable architectures. E. A. Deiana et al.(deiana2015multiobjective, ) proposed a scheduler based on mixed-integer linear programming (MILP). The scheduler can schedule and deploy applications on PDRS based on FPGA. And if the scheduler cannot successfully perform, the scheduler will be re-executed until a feasible floorplan is determined. However, the time consumption of the MILP-based approach is impractical for large-scale applications. Chen et al. (chen2018integrated, ) proposed the partitoned sequence triple P-ST to express partition, schedule, and floorplan of task modules for PDRS and used an SA-based search engine with a perturbation method to achieve a higher success rate of floorplan with optimizing schedule time. However, this work models the process of PDRS by modeling homogeneous resource, CLB, and thus simplifies these problems.

These works mentioned above either solve the three problems sequentially or solve only two of the three problems in an integrated framework, or model the process of PDRS by modeling homogeneous resources, thus simplifying these problems.

Table 1. Part representative works related to PDRS
Works Object Partition Floorplan Schedule Method Model
(bolchini2011automated, ) RR \surd SA-based Hetero
(banerjee2010floorplanning, )(yao2022fast, ) Task \surd \surd Heuristic Hetero
(ganesan2000integrated, )(rana2009minimization, ) Task \surd \surd Heuristic Hetero
(cordone2009partitioning, ) Task \surd \surd Heur/ILP Hetero
(yuh2004temporal, )(yuh2007temporal, ) Task \surd \surd SA-based Homo
(deiana2015multiobjective, ) Task \surd \surd MILP Hetero
(chen2018integrated, ) Task \surd \surd \surd SA-based Homo
Our work Task \surd \surd \surd Enum/SA/ILP Hetero
  • RR/Task:Reconfigurable region/Task module.

  • Homo: Model only one kind resource, CLB.

  • Hetero: Model several kinds resources among CLB, BRAM, DSP.

1.2. Our Contributions

In this paper, we propose a complete workflow to find the optimal solution of task modules partitioning, scheduling, and floorplanning on heterogeneous resources based on FPGA-PDRS. This workflow includes pre-processing, the exploration of task modules partitioning, scheduling, floorplanning, and post-optimization. The main contributions of this paper are outlined as follows.

  1. 1)

    In this pre-processing process, we propose the strategy of generating the list of task modules candidate shapes according to resources requirements of each task module. This strategy is based on the enumeration of task module shapes.

  2. 2)

    We change our previous algorithm(chen2018integrated, ) by changing the process of exploring solution space and adding the cost of heterogeneous resource utilization to generate the solution of task modules partitioning, scheduling and floorplanning. Then we integrate the changed algorithm into the complete workflow.

  3. 3)

    After given the partition, schedule, and floorplan of task modules, we build the ILP model to reselect task modules shape from the candidate shapes list to improve the success rate of floorplan.

The experimental results demonstrate the efficiency and effectiveness of the proposed algorithms. The remainder of the paper is organized as follows. Section 2 not only describes the hardware architecture and the process of FPGA-PDRS but also describes the definition and analysis of problems. Section 3 discusses the overall design workflow of generating the partition, schedule, and floorplan of task modules for FPGA-PDRS. Section 4 describes the proposed algorithms of pre-processing and post-optimization. The experimental results and conclusions are shown in Section 5 and Section 6, respectively.

2. Problem Description

2.1. FPGA-based Partially Dynamically Reconfigurable Architecture

2.1.1. Heterogeneous resources on FPGA chip

In modern FPGAs, the mainstream products are Xilinx devices. And there are mainly three kinds of programmable resources on an FPGA chip, configurable logic block (CLB), block random access memory (BRAM), and digital signal processing (DSP) unit.

As shown in Fig. 1, these heterogeneous resources are organized into irregular columns and distributed on the chip. Fig. 1 use the chip XC7VX485TXC7VX485T as an example (vc707, ). XC7VX485TXC7VX485T is the chip of evaluation kit VC707, which belongs to the family of Xilinx Virtex-7 series. In XC7VX485TXC7VX485T, there exists 146 column heterogeneous resources, including 111 columns of CLB, 15 columns of BRAM and 20 columns of DSP. Each column resource of CLB has 350 rows of tiles, while each column resource of BRAM or DSP has only 140 rows of tiles.

(1) ColBRAM={5,11,23,29,37,48,66,77,88,99,110,118,124,136,142}\displaystyle Col_{BRAM}=\{5,11,23,29,37,48,66,77,88,99,110,118,124,136,142\}
(2) ColDSP={14,20,26,34,40,45,51,63,69,74,80,85,91,96,102,107,113,121,127,133}\displaystyle\begin{split}&Col_{DSP}=\{14,20,26,34,40,45,51,63,69,74,80,85,91,96,102,107,113,121,127,133\}\end{split}

To model the heterogeneous resources of an FPGA chip, we use ColBRAMCol_{BRAM} and ColDSPCol_{DSP} to record the column position of the resource BRAM and DSP, respectively. For XC7VX485TXC7VX485T, the value of ColBRAMCol_{BRAM} and ColDSPCol_{DSP} is shown by Equation (1) and Equation (2), respectively. Then we can define the coordinate of each tile. For example, the coordinate of tile filled with black is (1,1)(1,1) and (16,350)(16,350), respectively in Fig. 1.

Refer to caption
Figure 1. The column structure of heterogeneous resources on the FPGA chip (Eg.XC7VX485T).

2.1.2. The process of partially dynamically reconfiguration based on FPGA

On FPGAs with PDR technology, the device can be partitioned into several reconfigurable regions. Each reconfigurable region can be reconfigured into different circuits at different time intervals. Here regard these time intervals of a reconfigurable region as time layers. To partition each task module into the related reconfigurable region and determine the related configurable time is highly complex. However, there is no complete automation tool to settle this problem, and the design of a partially dynamically reconfigurable system still depends on the human experience.

As shown in Fig. 2, there are reconfigurable regions DRRADRR_{A}, DRRBDRR_{B} and so on. For reconfigurable regions, there are corresponding configuration files, respectively. For example, reconfigurable region DRRADRR_{A} has configuration files A1.bitA_{1}.bit, A2.bitA_{2}.bit and so on, while reconfigurable region DRRBDRR_{B} has configuration files B1.bitB_{1}.bit, B2.bitB_{2}.bit and so on. These configuration files are stored in external memory and will be transferred on chip at different times. When the commands of configuration control are running, these configuration files will be transferred on chip by the ICAP interface (xilinx_pr, ). Then corresponding dynamic reconfigurable regions will be configured into corresponding circuits. As for different reconfigurable regions with corresponding configuration circuits at different times, we have the following definitions:

Definition 1.

DRR={drri|1iNDRR}{DRR}=\{drr_{i}|1\leq i\leq N_{DRR}\} , NDRRN_{DRR} is the number of reconfigurable regions.

Definition 2.

TL={(tlij|1iNDRR,1jNitl}{TL}=\{(tl_{i}^{j}|1\leq i\leq N_{DRR},1\leq j\leq N_{i}^{tl}\}, NitlN_{i}^{tl} is the number of time layers in reconfigurable region drridrr_{i}.

Refer to caption
Figure 2. Partial dynamic reconfigurable process based on FPGA.

2.2. Problem definition

To deploy a large-scale computing application to a partially dynamically reconfigurable system based on FPGA, we need to divide the application into several task modules and then partition, schedule and floorplan these task modules. We define the set of nn task modules as M={mi|1in}M=\{m_{i}|1\leq i\leq n\}.

Implementing the reconfiguration process requires synthesizing task modules. The synthesizing results will provide the resources requirements of task modules, which mainly include the demand for Lookup Table(LUT), Flip-Flop(FF), BRAM, and DSP. While CLB is composed of LUT and FF, we can regard the synthesizing results as the demand for CLB, BRAM, and DSP. The synthesizing and simulation results will also provide the execution time of each task module. In addition, to model the deployment of task modules in FPGA-PDRS, we add configuration time and shape to the task module. These notations and meaning related to the factors of task module and the chip are shown in Table 2.

Table 2. Notations and meaning related to the factors of task module and chip
Notation Meaning
mim_{i} task module i
CLBiCLB_{i} task module mim_{i}’s resource requirement for CLB
BRAMiBRAM_{i} task module mim_{i}’s resource requirement for BRAM
DSPiDSP_{i} task module mim_{i}’s resource requirement for DSP
widthiwidth_{i} the width of task module mim_{i}
heightiheight_{i} the height of task module mim_{i}
execiexec_{i} the execution time of task module mim_{i}
conficonf_{i} the configuration time of task module mim_{i}
WidthWidth the column number of the chip
HeightHeight the row number of the chip

There are data dependencies among these task modules, which are defined by a task graph TG=(VTG,ETG)TG=(V_{TG},E_{TG}) VTGV_{TG} represents the set of task modules (=M)(=M). ETGE_{TG} represents data dependencies among task modules. The definition of ETGE_{TG} is as follows:

Definition 3.

ETG={edgei,jk=(mi,mj)k|1i,jn,ij,kNedge}E_{TG}=\{edge_{i,j}^{k}=(m_{i},m_{j})^{k}|1\leq i,j\leq n,i\neq j,k\leq N_{edge}\}. NedgeN_{edge} means the number of data dependencies among task modules; edgei,jkedge_{i,j}^{k} means data produced by mim_{i} will be used by mjm_{j} and is marked as the k-th data dependency.

To ensure that the floorplan on heterogeneous resources is feasible, we consider the following constraints:

  1. (1)

    The region occupied by the task module on the chip must meet the resources requirements of CLB, BRAM, and DSP.

  2. (2)

    The task modules in the same time layer must be nonoverlapped and placed within their corresponding reconfigurable region. And the reconfigurable regions must be placed without overlapping each other.

  3. (3)

    The whole part of the PDRS needs to meet the boundary constraints, the width and the height of PDRS cannot exceed the width and the height of the chip, respectively.

In summary, we define the input, output, and objective of this study as follows:

Input: The graph of data dependency among task modules and related information of task modules.

Output: The solution of task modules partitioning, scheduling, and floorplanning in FPGA-PDRS.

Objective: Maximize the performance with improving the utilization of heterogeneous resources and reducing the communication cost.

2.3. Analysis of problems

2.3.1. Analysis of task modules partitioning, scheduling and floorplanning

Partition involves determining which task modules are in the same time layer of a reconfigurable region. Schedule is to determine the configuration order of time layers with the task modules inside. Floorplan is to determine the coordinates of reconfigurable regions and task modules.

As shown in Fig. 3, these three problems are NP-hard(murata2003rectangle, )(hartmanis1982computers, ). They affect each other and affect the area, communication cost and performance of PDRS. Task modules partitioning, scheduling, and floorplanning need to search for a proper solution that makes a trade-off between the factors such as area, performance, and energy consumption.

Refer to caption
Figure 3. Partition, Schedule and Floorplan problems are NP-hard, affect each other and PDRS.

2.3.2. The shape of task module

As described in Section 2.1.1, the on-chip programmable resources of FPGA are the column structure of CLB, BRAM and DSP. And many previous works related to partition, schedule and floorplan are based on the given shape or the given area of task modules or reconfigurable regions. However, after synthesizing each task module by a logic synthesis tool such as Vivado  (Viavdo, ), the synthesizing results we obtain are the resources requirements of each task module. Moreover, the column structure of the FPGA chip is irregular. The shapes of task modules mean the utilization of heterogeneous resources and are related to the location and resources requirements. For example, assuming there is a task module m1m_{1}. Its resources requirements for CLB, BRAM, and DSP are 20, 4, and 0, respectively. As shown in Fig. 1, shape1shape_{1} and shape2shape_{2}, both can meet the resources requirements of task module m1m_{1}, are placed from coordinates (4,0)(4,0) and (6,0)(6,0), respectively. The shape of shape1shape_{1} is 5105*10 with the utilization of resources CLB, BRAM and DSP are 20/40,4/4,020/40,4/4,0, respectively. While the shape of shape2shape_{2} is 858*5 with the utilization of resources CLB, BRAM and DSP are 20/30,4/4,020/30,4/4,0, respectively. It is obvious that shape2shape_{2} is better than shape1shape_{1} if only in terms of resources utilization or area.

There is no direct relationship between resources requirements and the shape of task module, and resource utilization can only be determined after determining the shape and the position of task module. The difficulty of this problem is to find the appropriate shapes and position according to each task modules’ resources requirements to ensure heterogeneous resources utilization.

2.3.3. The relationship between the shape of task modules and the shape of PDRS

As described in the Section 2.1.2, on FPGA chip with PDR technology, the on-chip region is divided into one or several dynamically reconfgurable region(s). Moreover, the dynamically reconfigurable region is rectangular on FPGA chip, which is as shown in Fig. 4. Each dynamically reconfgurable region may have one or several time layer(s). Different time layers implement different functionalities and execute at different times. And each time layer may have one or several task module(s).

There is no direct relationship between shape of task modules and the shape of PDRS. In general, the shape of the task module affects the shape of its time layer, and the shape of this time layer affects the shape of the reconfigurable region that it belongs. Similarly, the shapes of the reconfigurable regions affect the shape of the whole PDRS. For example, as shown in Fig. 4, dynamically reconfigurable region drr3drr_{3} has two time layers. Time layer tl31tl_{3}^{1} has two task modules m1m_{1} and m2m_{2} while time layer tl32tl_{3}^{2} has three task modules m3m_{3}, m4m_{4} and m5m_{5}. The shapes of time layers tl31tl^{1}_{3} and tl32tl^{2}_{3} are affected by the shapes of task modules that belong to them, respectively. The shapes of time layers tl31tl^{1}_{3} and tl32tl^{2}_{3} also affect the shape of dynamically reconfigurable region drr3drr_{3} and thus affecting the shape of the whole PDRS. The difficulty of this problem is that there is no direct relationship between the shapes of task modules and the shape of PDRS.

Refer to caption
Figure 4. Different time layers in the same drr with different utilization of heterogeneous resources.

3. Overall Design Flow

Fig. 5 shows the complete workflow, and the colored parts are the contributions of this work. This workflow includes three parts, pre-processing, the exploration of task modules partitioning, scheduling, floorplanning, and post-optimization.

Refer to caption
Figure 5. The complete workflow of generating task modules partition, schedule, floorplan on heterogeneous resources based on FPGA-PDRS.

The pre-processing involves generating the corresponding shape of the task module according to its resource requirements for CLB, BRAM, and DSP. Due to the irregular column structure of heterogeneous resources on the chip, the generated shape needs to meet the resource requirements of the task module at each position on the chip. Different shapes mean different utilization of heterogeneous resources and affect the quality of the solution. Therefore, in this process, we generate a candidate shape list of each task module for the process of solution space exploration and post-optimization. Each shape in this shape list has a different width with corresponding height and can meet the resource requirements of the task module at each position on the chip. This is illustrated in Section 4.1 in detail. And then, we initialize a temporary shape for each task module from its candidate list by minimum area as the input of exploration process.

In the process of exploring solution space, we change our previous work (chen2018integrated, ), which proposed the partitioned sequence triple to express partition, schedule, and floorplan of task modules and generate the optimized solution of PDRS but modeling homogeneous resource with simplifying these problems. This work is SA-based, and only considers the area cost, scheduling cost, and communication cost.

The exploring process use the partitioned sequence triple P-ST(PS, QS, RS), which was proposed by (chen2018integrated, ) to represent the partition, schedule, and the floorplan of task modules in PDRS. In P-ST (PS,QS,RS), (PS,QS) is a hybrid nested sequence pair (HNSP) and represent the spatial and temporal partitions, as well as the floorplan, RS is the partitioned dynamic configuration order of the time layer with its task modules.

In the exploring process, we first generate an initial solution and then search the optimal solution through the SA-based search engine. In each iteration process of the SA-based algorithm, we delete a task module from P-ST and then find a new location for insertion to generate a new solution. In the process of finding the optimal insertion location, there are two parts: rough evaluation and accurate evaluation. The rough evaluation evaluates the cost of scheduling time and area in linear time for all feasible insertion points, and then accurate evaluation will evaluate the selected insert locations from rough evaluation by recalculating the cost to find the best solution for this disturbance.

In the rough evaluation phase, when we evaluate the area cost of each insert point, we evaluate all the shapes in the candidate shape list of this deleted task module for the current insert point. And we choose the shape of this deleted task module that minimizes the area of the whole design of PDRS to evaluate this insert point. In this way, we explore the possibility of more floorplan solutions. And because the task module shape may be changed, the area cost may be reduced, thus making the solution with greater probability tends to optimize the scheduling time and communication cost.

In the accurate evaluation phase, we calculate the resource usage of the reconfigurable regions and calculate the cost of heterogeneous resource utilization and add this cost to the total cost. The cost of heterogeneous resource utilization is calculated by Equation (4). In Equation (4), CLBhaveCLB_{have}, BRAMhaveBRAM_{have}, and DSPhaveDSP_{have} mean the resources that the chip has, CLBuseCLB_{use}, BRAMuseBRAM_{use}, and DSPuseDSP_{use} mean the resources that reconfigurable regions use. Then the total cost function is shown by Equation (3), and all cost values are normalized.

In Equation (3), the objection function CosttotalCost_{total} is defined as a linear combination of schedule length, area cost, communication cost, and the cost of heterogeneous resources utilization. The area cost costareacost_{area} includes the cost of exceeding the boundary of the chip. Decreasing this cost means trying to make the floorplan of PDRS can meet the boundary constraint. The cost costschedulecost_{schedule} is the schedule time of PDRS, which means the performance and is the critical path from the beginning of the configuration process to the end of the execution of all task modules. The cost costcommcost_{comm} is the communication cost of the PDRS, which means the cost of data transfer. And the cost costheterocost_{hetero} means the utilization of heterogeneous resources, decreasing this cost means trying use more heterogeneous resources on the chip and thus improving the performance and reducing the communication cost of PDRS.

(3) Costtotal=\displaystyle Cost_{total}= α×costarea+β×costschedule+γ×costcomm+λ×costhetero\displaystyle\alpha\times cost_{area}+\beta\times cost_{schedule}+\gamma\times cost_{comm}+\lambda\times cost_{hetero}
(4) costhetero=CLBhaveCLBuse+BRAMhaveBRAMuse+DSPhaveDSPuse\displaystyle cost_{hetero}=\frac{CLB_{have}}{CLB_{use}}+\frac{BRAM_{have}}{BRAM_{use}}+\frac{DSP_{have}}{DSP_{use}}

After several iterations of the SA-based search engine, the framework generates the solution of task modules partitioning, scheduling, and floorplanning for the PDRS. The solution means that task modules’ partition, schedule, and floorplan of PDRS are determined. As shown in Fig. 4, each task module’s time layer and each time layer’s reconfigurable region are determined. The relative position relationships among task modules and the relative position relationships among reconfigurable regions are also determined.

To better represent the results of partition, schedule, and floorplan of task modules, we have the following definitions.

Definition 4.

If task module mim_{i} is partitioned into reconfigurable region drrjdrr_{j}, then drr(mi)=drrjdrr(m_{i})=drr_{j} .

Definition 5.

If task module mim_{i} is partitioned into the k-th time layer of reconfigurable region drrjdrr_{j}, then tl(mi)=tljktl(m_{i})=tl_{j}^{k} .

And we can use the Definition. 4 and Definition. 5 to express the relationship of partition between task modules. If drr(mi)==drr(mj)drr(m_{i})==drr(m_{j}), task modules mim_{i} and mjm_{j} are partitioned into the same reconfigurable region. And in contrast, if drr(mi)drr(mj)drr(m_{i})\neq drr(m_{j}), the two task modules are in different reconfigurable regions. If tl(mi)==tl(mj)tl(m_{i})==tl(m_{j}), task modules mim_{i} and mjm_{j} are partitioned into the same time layer of a reconfigurable region. If drr(mi)==drr(mj)drr(m_{i})==drr(m_{j}) and tl(mi)tl(mj)tl(m_{i})\neq tl(m_{j}), task modules mim_{i} and mjm_{j} are partitioned into the same reconfigurable region but different time layers.

In the process of exploring the solution space, there are the costs of schedule length, area, communication, and the utilization of heterogeneous resources. The process of exploring the solution space will make a trade-off among these costs and may generate a solution that the width or height of PDRS exceeds the boundary of the chip, which means the solution is infeasible.

If this solution is not feasible due to boundary constraint, we can try to make it meet the solution by the process of post-optimization. Based on these information of task modules’ partition, schedule, and floorplan and the candidate shape list that are generated in the process of pre-processing, we can reselect the shapes for all task modules to make the floorplan of PDRS meet the boundary constraint and thus improving the success rate of floorplan. In the process of post-optimization, we build an ILP model to reselect the shapes of task modules to improve the floorplan success rate of PDRS. The ILP model constrains the shape of task modules and the position relationship among task modules under the given solution of task module partition, schedule, and floorplan. Then we add the boundary constraints to this model and set the objective function of the model. Finally, we use the solver Gurobi(gurobi, ) to solve this model. The ILP model is illustrated in Section 4.2 in detail.

In this complete workflow, the pre-processing process, solution space exploration process, and post-optimization process are three relatively independent processes. These three processes affect the utilization of heterogeneous resources, the quality of the solution, and the floorplan success rate of the solution. After these three processes, we get the shape of each task module and the optimized solution for task module partitioning, scheduling, and floorplanning of PDRS.

4. Proposed Algorithms

4.1. Generation of Task Modules Shape List

Each task module has the demands for heterogeneous resources CLB, BRAM, and DSP on the chip. For the task module mim_{i}, the demand is denoted as [CLBi,BRAMi,DSPi][CLB_{i},BRAM_{i},DSP_{i}]. The irregular column structure on the chip and undetermined position of task modules make generating the width and height of task module mim_{i}, which are denoted as widthiwidth_{i} and heightiheight_{i}, respectively, a tough problem.

The shape of the task module mim_{i} is larger will reduce heterogeneous resources utilization. In contrast, the smaller shape of the task module mim_{i} may not meet the resource requirements in a certain position, thus inducing infeasible floorplan. To make the shape of task module mim_{i} meet the resource requirements at each position on the chip, the shape needs to move along the x-axis and meet the resource requirements in each movement.

Algorithm 1 Generation of task modules shape List
Input:Module resources requirements, and the irregular column structure of chip.
Output:The candidate list with NN candidate shapes of each task module.
M \leftarrow task module set; L \leftarrow \emptyset. \triangleright L : The set of candidate shape list
for mim_{i} in M do
      widthinitwidth_{init} \leftarrow max{CLBi/HeightCLB,BRAMi/HeightBRAM,DSPi/HeightDSP}\max\{CLB_{i}/Height_{CLB},BRAM_{i}/Height_{BRAM},DSP_{i}/Height_{DSP}\}
     NshapeiN_{shape}^{i} \leftarrow Width - widthinitwidth_{init} \triangleright NshapeiN_{shape}^{i}: the shape number of mim_{i}; Width : The width of chip
     for j \leftarrow 0 to NshapeiN_{shape}^{i} do
         widthijwidth_{i}^{j} \leftarrow widthinitwidth_{init}
         heightijheight_{i}^{j} \leftarrow Height \triangleright Height : The height of chip
         FlagmoveFlag_{move} \leftarrow Check feasible in each position of the chip
         if FlagmoveFlag_{move} == 0 then
              Continue
         else
              while FlagmoveFlag_{move} == 1 do
                  heightijheight_{i}^{j} \leftarrow heightijheight_{i}^{j} - 1
                  FlagmoveFlag_{move} \leftarrow Check feasible in each position of the chip
                  if FlagmoveFlag_{move} == 0 then
                       heightijheight_{i}^{j} \leftarrow heightijheight_{i}^{j} + 1
                       Break
                  end if
              end while
         end if
         shapeijshape_{i}^{j} \leftarrow [widthij,heightij][width_{i}^{j},height_{i}^{j}]
         if aspect ratio of shapeijshape_{i}^{j} \leq γ\gamma then
              Li.append(shapeij)L_{i}.append(shape_{i}^{j})
         end if
         LiL_{i} \leftarrow remove shapes with the same height but larger width
         LiL_{i} \leftarrow Select top nn shapes from LiL_{i} according to area of shape
     end for
end for

Since the irregular column structure on the chip is continuous, we adopt an enumeration strategy. The pseudo code for generating the candidate shape list is shown in Algorithm. 1. We traverse all task modules and all the possible widths of each task module on the chip. And for each width of task module mim_{i}, we find a height that can let the task module meet the resource requirements at all positions on the chip. Specifically, under each width of task module mim_{i}, if the maximum height can meet the resource requirements at each position on the chip, we reduce the height one by one until it cannot meet this condition. Since we use the enumeration strategy of increasing the width and determining the minimum height under each width, some shapes with the same height but different widths may appear. Those shapes with smaller widths are better shapes. Thus, we remove those shapes with the same height but larger width from the list. Then if the aspect ratio of the shape is less than γ\gamma, we add this shape to the candidate shape list. Since (chen2008fixed, ) interpret intuitively and show by experiments that an aspect ratio close to one is beneficial to wirelength, we set γ\gamma to 1.5 for optimization of communication cost and ensure a certain solution space margin.

Finally, according to area, we choose the top nn shapes from this list as the final candidate shape list. The value of nn is related to the complexity of the exploration process and the process of post-optimization. A relatively larger value of nn means that the exploration process can explore more floorplan solutions and better optimize schedule time and communication cost. Also, the larger value of nn provides a larger solution space of reselecting task modules shapes for the process of post-optimization and can better improve the floorplan success rate of the solution. And in this work, we empirically set nn to 10.

After generating the candidate shape list for each task module, we initialize the shape of each task module from the corresponding list. In this paper, we use the shape with minimum area as the temporary shape of the task module. Then we use these initialized task modules shape as the input of the exploration process to generate the solution of task modules partitioning, scheduling, and floorplanning of PDRS.

4.2. Reselection of Task Modules Shape

The process of exploring the solution space reduces the total cost in each iteration. And to ensure the performance of PDRS, the floorplan of the generated solution may exceed the boundary, which is like the left figure of Fig. 6(b). As described in Section 2.3.3, there are no direct relationships between the task module shapes and the shape of the whole PDRS, and the shape of the task module affects the shape of its time layer and thus affects the shape of the reconfigurable region. To make the floorplan of this solution feasible, we can reselect the task modules’ shapes from the candidate shape lists, and then change the shape of reconfigurable regions thus generating the feasible floorplan like the right figure of Fig. 6(b).

As shown in Fig. 6(a), assuming there are mm task modules, and there are nn shapes in the list of each task module. Then the solution space size is nmn^{m}. Therefore we establish a ILP model to solve it and try to find feasible floorplan of the solution through the solver Gurobi. The ILP model involves three-part constraints, one is the constraint of task modules shape, one is the constraint of the position relationship among task modules, and one is the boundary constraint of the whole design of PDRS.

Refer to caption
((a))
Refer to caption
((b))
Figure 6. (a)The matrix of candidate shape list for task modules. (b) The example of reselect shape to meet boundary constraint.

We use a matrix to represent the shape selection of task modules, which likes Fig. 6(a) and denoted as MSMS. If task module mim_{i} use shape sjs_{j}, then we denote this by MS[i,j]=1MS[i,j]=1. A task module can only have one shape, so we have the following constraint for selecting the shape of each task module. This constraint is shown by Equation (5), and in Equation (5), mm is the number of task modules.

(5) miM,1im,sj,j=1nMS[i,j]=1\forall{m_{i}}\in M,1\leq i\leq m,\sum_{s_{j},j=1}^{n}{MS[i,j]}=1

Then, the width and height of task module mim_{i} can be expresented by linear expression Equation (6) and Equation (7), respectively.

(6) widthi=sj,j=1nMS[i,j]×widthijwidth_{i}=\sum_{s_{j},j=1}^{n}{MS[i,j]\times width_{i}^{j}}
(7) heighti=sj,j=1nMS[i,j]×heightijheight_{i}=\sum_{s_{j},j=1}^{n}{MS[i,j]\times height_{i}^{j}}

After adding constraints to the shape of task modules, we then add constraints to the positional relationship among task modules and finally get the boundary of the whole design of PDRS. In this work, we use P-ST that is proposed in our previous work (chen2018integrated, ) to represent the solution information. P-ST includes the Sequence Pair(SP)(murata1996vlsi, ) to represent floorplan, and the information of reconfigurable regions and time layers are added to SP to express partition.

Algorithm 2 Constraint of position relations in ILP model
iter0iter\leftarrow 0
for mim_{i} in msms do
     Constraint: Matx[iter][mi]Mat_{x}[iter][m_{i}] \geq Matx[iter][mi1]+widthiMat_{x}[iter][m_{i-1}]+width_{i} \triangleright Update linear expressions of horizontal position
     iteriter+1iter\leftarrow iter+1
     for mjm_{j} in ps[mi:]ps[m_{i}:] do
         if drr(mi)drr(m_{i}) != drr(mj)drr(m_{j}) or tl(mi)tl(m_{i}) == tl(mj)tl(m_{j}) then
              Constraint: Matx[iter][mj]Mat_{x}[iter][m_{j}] \geq Matx[iter][mi])Mat_{x}[iter][m_{i}]) \triangleright Constrained coordinate of horizontal position relation
         end if
     end for
     for mjm_{j} in psps do
         if drr(mi)drr(m_{i}) != drr(mj)drr(m_{j}) or tl(mi)tl(m_{i}) == tl(mj)tl(m_{j}) then
              Constraint: Matx[iter][mj]Mat_{x}[iter][m_{j}] \geq Matx[iter1][mj])Mat_{x}[iter-1][m_{j}]) \triangleright Store linear expressions to update
         end if
     end for
     Constraint: Maty[iter][mi]Mat_{y}[iter][m_{i}] \geq Maty[iter][mi+1]+heightiMat_{y}[iter][m_{i+1}]+height_{i} \triangleright Update linear expressions of vertical position
     for mjm_{j} in ps[:mi]ps[:m_{i}] do
         if drr(mi)drr(m_{i}) != drr(mj)drr(m_{j}) or tl(mi)tl(m_{i}) == tl(mj)tl(m_{j}) then
              Constraint: Maty[iter][mj]Mat_{y}[iter][m_{j}] \geq Maty[iter][mi])Mat_{y}[iter][m_{i}]) \triangleright Constrained coordinate of vertical position relation
         end if
     end for
     for mjm_{j} in psps do
         if drr(mi)drr(m_{i}) != drr(mj)drr(m_{j}) or tl(mi)tl(m_{i}) == tl(mj)tl(m_{j}) then
              Constraint: Maty[iter][mj]Mat_{y}[iter][m_{j}] \geq Maty[iter1][mj])Mat_{y}[iter-1][m_{j}]) \triangleright Store linear expressions to update
         end if
     end for
end for

All constraints of calculating coordinates are added to these two matrices. By traversing SP and adding constraints to each row of the two matrices, we can get the linear expression of the constraints coordinate of the task modules floorplan. And for the boundary of floorplan of PDRS, we have the following constraints Equation (8) and Equation (9) to limit XmaxX_{max} and YmaxY_{max}, respectively.

(8) valuexMatx[m],valuexXmax,\forall{value_{x}}\in Mat_{x}[m],value_{x}\leq X_{max},
(9) valueyMaty[m],valueyYmax,\forall{value_{y}}\in Mat_{y}[m],value_{y}\leq Y_{max},

The feasible floorplan means that the whole PDRS should be placed within the chip boundary, therefore we add the constraints to the boundary of PDRS. The constraints are shown by Equation (10), and in Equation (10), WidthWidth and HeightHeight are the boundaries of the chip.

(10) XmaxWidth,YmaxHeightX_{max}\leq Width,Y_{max}\leq Height

Finally, we set the objective function of this model as to maximize the value of subtracting between chip boundary and the boundary of the whole design PDRS. This objective is shown by Equation (11).

(11) max{WidthXmax+HeightYmax}\max\{Width-X_{max}+Height-Y_{max}\}

5. Experiments and Results

5.1. Experimental Setup

In this study, all proposed methods were implemented in C/C++ language on a 64-bit Linux workstation (Intel(R) Xeon(R) Gold 6254 CPU @ 3.10 GHz, 125 GB RAM). Since there are no standard benchmarks for the discussed problems, we constructed the benchmarks by using the tool Task Graphs For Free (TGFF)  (1998TGFF, ) to generate a directed acyclic graph(DAG) to represent the task modules and the data dependencies among task modules. We denote this graph as task graph(TG). The information of nodes and edges, which represent task modules and data dependencies, respectively, in TG, is shown in Table 3.

We also generate two benchmarks from two popular convolutional neural network models, AlexNet(krizhevsky2017imagenet, ) and VGG(simonyan2014very, ). The two benchmarks respectively include all the convolution layers, pooling layers, and full connection layers of the two neural networks. The convolution layer and full connection layer mainly include the multiply-add computing unit, and the pooling layer mainly includes the comparator. Therefore, we synthesize the heterogeneous resources usage of the multiply-add computing unit and the comparator, respectively. We use several multiply-add computing units as a convolution kernel and several comparators as a pooling kernel. For the convolution layer and full connection layer with larger computation, we use several convolution cores to perform convolution operation over these feature maps by channel parallel, and for the pooling layer, we use a specified pooling core to perform pooling operations over all the output feature maps. The execution times of task modules are estimated based on a frequency of 200MHz.

Table 3. Benchmark Information
Bench. imp. TG
#V #E VWR (ms)(ms) EWR CPT (ms)(ms) CLB BRAM DSP
t10 1 10 8 (40,55) (20,30) 151.1 (2000,3000) (0,80) (0,80)
2 10 (40,55) (20,30) 259.2 (2500,3500) (20,100) (20,100)
3 12 (40,55) (20,30) 359.1 (3000,4000) (40,120) (40,120)
t30 1 30 71 (40,60) (20,30) 727.6 (2000,3000) (0,80) (0,80)
2 51 (30,350) (60,610) 1450.8 (2500,3500) (20,100) (20,100)
3 72 (40,60) (20,30) 786.3 (3000,4000) (40,120) (40,120)
t50 1 50 78 (40,60) (20,30) 595.9 (2000,3000) (0,80) (0,80)
2 33 (40,60) (20,30) 244.0 (2500,3500) (20,100) (20,100)
3 51 (20,180) (50,350) 464.9 (3000,4000) (40,120) (40,120)
t100 1 100 110 (20,180) (50,350) 703.1 (2000,3000) (0,80) (0,80)
2 134 (20,180) (50,350) 1265.6 (2500,3500) (20,100) (20,100)
3 147 (20,180) (50,350) 580.5 (3000,4000) (40,120) (40,120)
t200 1 200 403 (10,390) (30,770) 1482 (2000,3000) (0,80) (0,80)
2 312 (10,390) (30,770) 3556.2 (2500,3500) (20,100) (20,100)
3 327 (40,60) (20,30) 436.4 (3000,4000) (40,120) (40,120)
AlexNet 17 26 (0.7,7.5) (4,189.1) 34.4 (400,1800) (5,10) (0,8)
VGG 47 108 (5.1,61.7) (4,784) 498.2 (400,1800) (5,10) (0,8)

In Table 3, each benchmark has three different implementations and these implementations has the same number of task modules but different task dependencies. And #V\#V and #E\#E denote the number of nodes and edges in TG, respectively. The column VMRVMR represents the range of random values of execution time for task modules, the column EWREWR means the range of random values of communication among task modules, and the column CPTCPT represents the critical paths of the task graphs, in which the task modules are weighted by execution time. The column CLBCLB, BRAMBRAM, and DSPDSP represent the range of random values of task modules’ resources requirements for CLB, BRAM, and DSP. And in this work, we use one of the most widely used Xilinx Virtex-7 series FPGA chips, XC7VX485T, as the target chip.

5.2. Experimental results and analysis

In this section, we show and analyze the experimental results of the comparison. The complete workflow in this paper includes three parts, pre-processing, the exploration of task modules partitioning, scheduling, floorplanning, and post-optimization. To demonstrate the efficiency and effectiveness of the three processes, we accomplish three comparative experiments.

Firstly, we compare the effect of our method of generating task modules shapes on heterogeneous resources usage with (banerjee2009fast, ). Secondly, we compare the solution generated by our complete workflow with the solution generated by the method Int_PSFInt\_PSF in previous work(chen2018integrated, ) on schedule time, communication cost, floorplan success rate, and heterogeneous resources reuse rate. Thirdly, We compare the improvement of the floorplan success rate through the ILP model in post-optimization with the solution generated by the exploration process.

5.2.1. Comparison of the generated task module shapes

The generation of task modules shape involves the area of task modules and the utilization of heterogeneous resources. P. B et al.(banerjee2009fast, ) found the smallest width on the chip that includes CLB, BRAM and DSP resources and then generated the height of task modules based on this width. However, this method is not conducive to the utilization of resources. To demonstrate the efficiency of our initialization strategy of task module shapes and show the impact on task module area and resource utilization, we have compared these results with (banerjee2009fast, ) and show the experiment results in Table 4. The columns CLBCLB, BRAMBRAM, and DSPDSP in the column NeedResourcesNeed\ Resources mean the number of resource requirements of CLB, BRAM, and DSP for each task module in the corresponding benchmark, respectively, on average. In the columns [34][34] and OurmethodOur\ method, the CLBCLB, BRAMBRAM, DSPDSP, mean the number of corresponding resources to be used on the chip of each task module, respectively, on average. And the column AreaArea means means the total area of the task module shapes in the corresponding benchmark.

The experimental results show that, compared with (banerjee2009fast, ), our task module shapes generation method can reduce the occupied resources while meeting the resource requirements of each position on the chip. And as the average occupied area is decreased by 6%, the occupied resources of CLB, BRAM and DSP are decreased by 6.6%, 13.2%, and 4.7%, respectively, on average.

Table 4. Comparison of the generated task module shapes
Bench. Imp. Need resources (banerjee2009fast, ) Our method
CLB BRAM DSP CLB BRAM DSP Area CLB BRAM DSP Area
t10 1 2388 33 36 2622 139 214 35321 2484 125 206 33507
2 2977 56 62 3282 174 269 44213 3080 154 258 41538
3 3437 76 86 3852 204 315 51889 3555 175 296 47966
t30 1 2454 39 39 2713 144 222 109649 2550 128 212 103181
2 2981 63 56 3268 173 268 132069 3088 154 258 124949
3 3518 84 82 3927 208 321 158707 3637 179 304 147233
t50 1 2569 36 42 2807 149 230 189088 2670 133 222 180042
2 2995 60 58 3291 175 269 221692 3108 155 259 209584
3 3540 79 76 3908 207 320 263226 3660 180 305 246903
t100 1 2455 39 38 2703 144 221 364078 2550 127 212 343912
2 3006 60 60 3316 176 272 446709 3118 155 260 420341
3 3482 75 76 3837 204 314 516952 3600 177 300 485864
t200 1 2491 41 41 2743 146 225 739157 2588 129 215 698020
2 2962 61 60 3284 174 269 884697 3073 153 257 828717
3 3511 83 81 3917 208 321 1055412 3636 179 303 981057
AlexNet (V:17) 1573 5 5 1714 91 141 3925 1650 83 139 37828
VGG       (V:47) 1639 5 5 1787 95 147 113107 1719 86 144 108956
Cmp. 2984 59 60 3298 175 270 347524 3093 154 258 326188
1 1 1 1 93.4% 86.8% 95.3% 94.0%

5.2.2. Comparison of generating the solution of task modules partition, schedule and floorplan

The method Int_PSFInt\_PSF in (chen2018integrated, ) can be used to explore the solution space to search the optimal solution of task modules partitioning, scheduling, and floorplanning on the PDRS with modeling the chip with homogeneous resource, CLB. Our complete workflow explores more possibility of floorplan solutions, thus making the solution with greater probability tends to optimize the scheduling time and communication cost. The complete workflow makes the design use more resources on the chip with trying to let the whole design of PDRS satisfy the boundary constraints, thus reducing the scheduling time of the entire design and decreasing the communication cost. Therefore, we compare the solutions generated by these two exploration methods. In this comparative experiment, we run the two exploration methods 10 times independently to make a comparison and show the experiment results in Table 5.

Table 5. Comparison of generating the solution of task modules partition, schedule and floorplan
Imp. Int_PSF(chen2018integrated, ) Our complete workflow
SchBSchB SchASchA CCCC RRTRRT succsucc RTRT SchASchA SchMSchM CCCC RRTRRT succsucc RTRT
t10-1 160.1 173.9 25147 [28.9%, 27.1%,30.2%] 100% 2.4 155.1 155.1 21057 [55.7%, 53.7%,58.6%] 100% 2.53
t10-2 265.4 267 36153 [28.6%, 28.2%,29.5%] 100% 2.3 265.4 265.4 23646 [62.0%, 61.7%,63.1%] 100% 1.98
t10-3 365.1 368.4 60296 [24.7%, 24.3%,23.1%] 100% 1.9 365.1 365.1 54662 [67.1%, 66.9%,67.3%] 100% 2.04
t30-1 731.7 732.5 464762 [27.5%, 26.4%,29.0%] 100% 12.3 731.7 731.7 473692 [53.0%, 51.8%,57.7%] 100% 11.27
t30-2 1456.3 1456.3 5471700 [42.5%, 42.0%,45.9%] 100% 9.7 1456.3 1456.3 5378078 [70.5%, 70.5%,72.0%] 100% 8.97
t30-3 797 810.3 530415 [32.8%, 31.2%,30.3%] 100% 9.4 792.2 792.2 515464 [54.7%, 54.8%,55.6%] 90% 8.92
t50-1 601.1 601.1 421508 [43.4%, 42.7%,46.5%] 100% 25.6 601.1 601.1 376342 [54.1%, 54.3%,54.5%] 100% 20.55
t50-2 326.9 333.8 187369 [73.5%, 73.8%,79.2%] 100% 14.9 326.3 330.3 172231 [74.9%, 75.1%,75.1%] 100% 20.4
t50-3 1080.8 1093.7 3973633 [44.9%, 43.9%,42.0%] 100% 19.7 598.6 684.2 2949261 [83.1%, 83.3%,83.3%] 70% 19.43
t100-1 993.5 1069.4 6532062 [68.0%, 67.7%,73.5%] 100% 70.9 897.5 975.7 5278860 [82.3%, 82.8%,82.8%] 100% 77.66
t100-2 1440 1512.4 7665218 [60.2%, 60.1%,64.6%] 100% 69 1282.3 1341.9 7160849 [72.9%, 73.7%,73.7%] 100% 77.21
t100-3 2309.2 2317.2 29033640 [47.0%, 46.2%,43.7%] 100% 69 1192.4 1302.3 19075710 [90.4%, 90.5%,90.6%] 100% 72.7
t200-1 3450 3404.6 1.96E+08 [82.2%, 79.5%,88.0%] 100% 219.6 3051.5 3759.2 1.87E+08 [82.3%, 82.3%,82.4%] 100% 564.7
t200-2 5377.1 5245.6 99734250 [61.5%, 61.1%,65.5%] 100% 267.5 4658.1 5067.4 93725770 [73.6%, 73.4%,74.7%] 100% 610.4
t200-3 2809.7 2788.2 10030710 [46.6%, 46.1%,43.2%] 100% 258.2 1546.3 1743.1 7174263 [88.1%, 88.3%,88.6%] 100% 523.9
AlexNet 50.1 50.3 72789 [33.3%, 33.0%,33.8%] 100% 5.8 50.1 50.1 70066 [59.0%, 58.0%,62.8%] 100% 5.2
VGG 503.5 505.1 3066295 [35.2%, 34.5%,36.3%] 100% 20.9 503.5 504.4 2737673 [69.4%, 69.2%,72.5%] 100% 25.4
Cmp. 1336.3 1337.0 21361003 [45.9%, 45.2%,47.3%] 100% - 1086.7 1183.9 19527196 [70.2%, 70.0%,71.5%] 98% -
1 1 1 - - - 81.3% 88.5% 91.4% +24.3%, +24.8%, +24.2% - -

In Table 5, the column SchBSchB and SchASchA is the best and the average value of scheduling time, respectively. And their unit is msms. The column CCCC is the average communication cost and is calculated by the method in (chen2018integrated, ). The column RRTRRT means the resources reuse rate and is calculated by Equation (12). In Equation (12), ScheduletimeSchedule_{time} is the schedule time of the PDRS, ResourceichipResource_{i}^{chip} means the number of the corresponding resource that the chip has, the ResourceidrrjResource_{i}^{drr_{j}} means the number of the corresponding resource that the reconfigurable region drrjdrr_{j} has, the columns confjconf_{j} and execjexec_{j} mean the total configuration time and total execution time of the reconfigurable region drrjdrr_{j}, respectively, and the column RTRT means the run time of the exploration process. These value are calculated from the experiment results that has feasible floorplan in the 10 independent experiment results. And the column succsucc means the success rate of floorplan.

(12) RRTi,i=CLB,BRAM,DSP=jNDRRResourceidrrj×(confdrrj+execdrrj)ScheduletimeResourceichipRRT_{i,i=CLB,BRAM,DSP}=\frac{\sum_{j}^{N_{DRR}}{Resource_{i}^{drr_{j}}}\times(conf_{drr_{j}}+exec_{drr_{j}})}{Schedule_{time}*Resource_{i}^{chip}}

As shown in Table 5, in these experiments that have feasible floorplan, our complete workflow can fully utilize the on-chip resources, improve the resource reuse rate, and thus find a smaller scheduling time and communication cost. Compared to the method Int_PSFInt\_PSF in (chen2018integrated, ), the best schedule time and the average schedule time can be 18.7% and 11.5% better, respectively, the communication cost can be 8.6% better, the resources reuse rate of CLB, BRAM, and DSP is 24.3%, 24.8%, and 24.2% better, respectively.

5.2.3. Comparison of reselecting task modules shape

The floorplan of the solution generated by the exploration process may be infeasible due to the boundary constraint. We use the ILP model to constrain the position relationship among task modules that the partition, schedule, and floorplan have been determined, and reselect the shape of task modules so that improve the success rate of the floorplan. It is worth noting that, because some infeasible solutions can be feasible by post-optimization, it is possible to find a smaller schedule time just like the benchmark t2002t200-2. The experimental results are shown as Table 6, by solving the ILP model, the floorplan success rate can be improved to 98%, compared with the solutions generated by the exploration process.

Table 6. Comparison of reselecting task modules shape
Bench. Imp. Solution After ILP
SchBSchB succ SchBSchB succ RT
t10 1 155.1 100% 155.1 100% 2.5
2 265.4 100% 265.4 100% 1.9
3 365.1 100% 365.1 100% 2
t30 1 731.7 100% 731.7 100% 11.1
2 1456.3 50% 1456.3 100% 8.6
3 792.2 90% 792.2 90% 8.8
t50 1 601.1 80% 601.1 100% 19.4
2 326.3 60% 326.3 100% 16.7
3 598.6 70% 598.6 70% 18.7
t100 1 897.5 70% 897.5 100% 64.5
2 1282.3 100% 1282.3 100% 64.3
3 1192.4 100% 1192.4 100% 63.4
t200 1 3051.5 30% 3051.5 100% 160.4
2 4672.8 80% *4658.1 100% 202.4
3 1546.3 100% 1546.3 100% 237.6
AlexNet 50.1 100% 50.1 100% <0.1
VGG 503.5 100% 503.5 100% 1
Cmp. - 84% - 98% -

6. Conclusions

In this paper, we propose the complete workflow to solve the problem of task modules partitioning, scheduling, and floorplanning on heterogeneous resources based on FPGA-PDRS. This complete workflow includes three parts, pre-processing, the exploration of task modules partitioning, scheduling, floorplanning, and post-optimization. Firstly, in this pre-processing process, we propose the strategy of generating the list of task modules candidate shapes according to resources requirements of each task module. Then we change the algorithm in our previous work by changing the process of exploring solution space and adding the cost of heterogeneous resource utilization to generate the solution of task modules partitioning, scheduling and floorplanning and integrate this algorithm into the complete workflow. Finally, based on the solution generated by the exploration process, we build the ILP model to reselect task modules shape from the candidate shapes list to improve the success rate of floorplan. Experimental results show that, compared with state-of-the-art work, the proposed complete workflow can improve performance by 18.7%, reduce communication cost by 8.6%. And based on the solution generated by the exploration process, the post-optimization can optimize the floorplan success rate to 98% with a 14% improvement.

7. ACKNOWLEDGMENTS

This work was supported in part by the National Key R&D Program of China under grant No. 2019YFB2204800, in part by National Natural Science Foundation of China (NSFC) under grant Nos. 61874102, U19A2074, 61931008, and 62141415, in part by CAS Project for Young Scientists in Basic Research under grant No. YSBR-029k, in part by the Strategic Priority Research Program of Chinese Academy of Sciences, Grant No. XDB44000000. The authors would like to thank Information Science Laboratory Center of USTC for the hardware & software services.

References

  • (1) L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei, “A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications,” ACM Computing Surveys (CSUR), vol. 52, no. 6, pp. 1–39, 2019.
  • (2) C. Wang, W. Lou, L. Gong, L. Jin, L. Tan, Y. Hu, X. Li, and X. Zhou, “Reconfigurable hardware accelerators: Opportunities, trends, and challenges,” arXiv preprint arXiv:1712.04771, 2017.
  • (3) K. Vipin and S. A. Fahmy, “Fpga dynamic and partial reconfiguration: A survey of architectures, methods, and applications,” ACM Computing Surveys (CSUR), vol. 51, no. 4, pp. 1–39, 2018.
  • (4) S. Hauck and A. DeHon, Reconfigurable computing: the theory and practice of FPGA-based computation.   Elsevier, 2010.
  • (5) D. Koch, Partial reconfiguration on FPGAs: architectures, tools and applications.   Springer Science & Business Media, 2012, vol. 153.
  • (6) C. Bolchini, A. Miele, and C. Sandionigi, “Automated resource-aware floorplanning of reconfigurable areas in partially-reconfigurable fpga systems,” in 2011 21st International Conference on Field Programmable Logic and Applications.   IEEE, 2011, pp. 532–538.
  • (7) P. Banerjee, M. Sangtani, and S. Sur-Kolay, “Floorplanning for partially reconfigurable fpgas,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 30, no. 1, pp. 8–17, 2010.
  • (8) R. Yao, Y. Zhao, Y. Yu, Y. Zhao, and X. Zhong, “Fast search and efficient placement algorithm for reconfigurable tasks on modern heterogeneous fpgas,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 4, pp. 474–487, 2022.
  • (9) M. Rabozzi, A. Miele, and M. D. Santambrogio, “Floorplanning for partially-reconfigurable fpgas via feasible placements detection,” in 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.   IEEE, 2015, pp. 252–255.
  • (10) N. Liu, S. Chen, and T. Yoshimura, “Resource-aware multi-layer floorplanning for partially reconfigurable fpgas,” IEICE transactions on electronics, vol. 96, no. 4, pp. 501–510, 2013.
  • (11) A. Montone, M. D. Santambrogio, D. Sciuto, and S. O. Memik, “Placement and floorplanning in dynamically reconfigurable fpgas,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 3, no. 4, pp. 1–34, 2010.
  • (12) L. Singhal and E. Bozorgzadeh, “Multi-layer floorplanning on a sequence of reconfigurable designs,” in 2006 International Conference on Field Programmable Logic and Applications.   IEEE, 2006, pp. 1–8.
  • (13) S. Ganesan and R. Vemuri, “An integrated temporal partitioning and partial reconfiguration technique for design latency improvement,” in Proceedings Design, Automation and Test in Europe Conference and Exhibition 2000 (Cat. No. PR00537).   IEEE, 2000, pp. 320–325.
  • (14) V. Rana, S. Murali, D. Atienza, M. D. Santambrogio, L. Benini, and D. Sciuto, “Minimization of the reconfiguration latency for the mapping of applications on fpga-based systems,” in Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis, 2009, pp. 325–334.
  • (15) R. Cordone, F. Redaelli, M. A. Redaelli, M. D. Santambrogio, and D. Sciuto, “Partitioning and scheduling of task graphs on partially dynamically reconfigurable fpgas,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 28, no. 5, pp. 662–675, 2009.
  • (16) Y.-C. Jiang and J.-F. Wang, “Temporal partitioning data flow graphs for dynamically reconfigurable computing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 15, no. 12, pp. 1351–1361, 2007.
  • (17) A. Purgato, D. Tantillo, M. Rabozzi, D. Sciuto, and M. D. Santambrogio, “Resource-efficient scheduling for partially-reconfigurable fpga-based systems,” in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).   IEEE, 2016, pp. 189–197.
  • (18) Q. Tang, Z. Wang, B. Guo, L.-H. Zhu, and J.-B. Wei, “Partitioning and scheduling with module merging on dynamic partial reconfigurable fpgas,” ACM Transactions on Reconfigurable Technology and Systems (TRETS), vol. 13, no. 3, pp. 1–24, 2020.
  • (19) P.-H. Yuh, C.-L. Yang, and Y.-W. Chang, “Temporal floorplanning using the t-tree formulation,” in IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004.   IEEE, 2004, pp. 300–305.
  • (20) ——, “Temporal floorplanning using the three-dimensional transitive closure subgraph,” ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 12, no. 4, pp. 37–es, 2007.
  • (21) E. A. Deiana, M. Rabozzi, R. Cattaneo, and M. D. Santambrogio, “A multiobjective reconfiguration-aware scheduler for fpga-based heterogeneous architectures,” in 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).   IEEE, 2015, pp. 1–6.
  • (22) S. Chen, J. Huang, X. Xu, B. Ding, and Q. Xu, “Integrated optimization of partitioning, scheduling, and floorplanning for partially dynamically reconfigurable systems,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 1, pp. 199–212, 2018.
  • (23) I. Amd Xilinx. (2019) Vc707 evaluation board:user guide. [Online]. Available: https://docs.xilinx.com/v/u/en-US/ug885_VC707_Eval_Bd
  • (24) ——. (2022) Vivado design suite user guide: Dynamic function exchange (ug909). [Online]. Available: https://docs.xilinx.com/r/en-US/ug909-vivado-partial-reconfiguration
  • (25) H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani, “Rectangle-packing-based module placement,” in The Best of ICCAD.   Springer, 2003, pp. 535–548.
  • (26) J. Hartmanis, “Computers and intractability: a guide to the theory of np-completeness (michael r. garey and david s. johnson),” Siam Review, vol. 24, no. 1, p. 90, 1982.
  • (27) I. AMD Xilinx. (2022) Vivado design suite user guide: Synthesis. http://www.xilinx.com.
  • (28) I. Gurobi Optimization. (2016) Gurobi optimizer reference manual. [Online]. Available: http://www.gurobi.com
  • (29) S. Chen and T. Yoshimura, “Fixed-outline floorplanning: Block-position enumeration and a new method for calculating area costs,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 27, no. 5, pp. 858–871, 2008.
  • (30) H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani, “Vlsi module placement based on rectangle-packing by the sequence-pair,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 12, pp. 1518–1524, 1996.
  • (31) R. P. Dick, D. L. Rhodes, and W. Wolf, “Tgff: task graphs for free,” 1998.
  • (32) A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
  • (33) K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • (34) P. Banerjee, S. Sur-Kolay, and A. Bishnu, “Fast unified floorplan topology generation and sizing on heterogeneous fpgas,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 28, no. 5, pp. 651–661, 2009.