Hi, within this class we will present the loop pipelining optimization which is essential for achieving highly efficient usage of the FPGA hardware resources. Let’s come back to our vector sum example once more. And in particular, let’s focus on the loop performing the actual vector sum. As we have seen in the previous lesson, this naive implementation of the loop requires 10240 cycles to complete. Each iteration is performed in 10 cycles and we have a total of 1024 iterations. By looking at the Vivado HLS schedule from the analysis report, we can see where our cycles are spent: 2 cycles for the read operations, 7 cycles for the floating-point addition and 1 cycle for the write operation. The floating-point adder as well as the logic for performing the read operations are internally divided in stages. Every stage computes a partial result of the operation and forwards its data to the next stage. Hence, if we think on how this loop is executed in hardware, we can clearly see that we are under-utilizing our resources. Indeed, a given stage of the floating-point adder is executed once every 10 cycles, meaning that it is used only 10% of the time! In order to improve performance as well as resource utilization, we can pipeline the loops so that each loop iteration starts as soon as possible instead of waiting 10 cycles. In our example, since each loop iteration work on different data elements and there are no loop carried-dependencies, in principle, we are able to start one loop iteration at every clock cycle. The overall idea is shown in this figure. With loop pipelining, we switch from a sequential execution of the loop iterations, shown on top, to a pipelined execution in which the loop iterations are overlapped in time. The number of clock cycles between two subsequent iterations of a pipelined loop is referred as Initiation Interval, or simply II. The minimum possible Initiation Interval that can be achieved for a pipelined loop is 1. Meaning that each loop iteration can start at every cycle. However, depending on the loop being pipelined, it might not be possible to achieve the ideal Initiation Interval of 1 cycle. When we achieve an initiation interval of 1, it also means that, after the initial time needed to fill the pipeline, all the stages of the operators within the loop are fully utilized at every clock cycle. In order to compare the performance of the sequential and the pipelined loop implementations we can work out some math. First, when all the iterations are performed in sequence, the overall latency of the loop can be computed by multiplying the Iteration Latency, referred as IL, by the Number of iterations, or trip count, N of the loop. In our example this yelds 10240 cycles. On the other hand, the latency of the pipelined loop can be derived as follow. We need Initiation Interval times N – 1 cycles to start the first N – 1 loop iterations, plus the time needed to complete the last iteration that takes 10 cycles, which is the iteration latency of the loop. Overall, this yields a loop latency of 1033 cycles which is about 10x better than the original loop latency! Notice that, compared to unrolling, loop pipelining does not significantly increase the resource consumption of our design, indeed, with pipelining we are simply making a better use under-utilized hardware resources. With Vivado HLS we can use the HLS PIPELINE pragma within the loop that we wish to pipeline. As we can see from the latency report, the sum_loop is flagged as pipelined, it achieves an initiation interval of 1 clock cycle with an iteration latency of 10 cycles. Overall the loop latency reported by Vivado HLS is 1032 cycles. The 1 cycle difference compared to the previous formula is simply due to the fact that Vivado HLS accounts for such cycle within the function body instead of the loop itself. Up to now we have applied either pipelining or unrolling optimizations to a single loop, but nothing prevent us from applying both optimizations to the very same loop! Let’s now try to apply an unrolling factor of 2 to the our pipelined loop. As we can see from the latency report, by applying both optimizations we managed to further optimize our loop and reduce its latency from 1032 cycles to 520 cycles. Vivado HLS managed to keep an initiation interval of 1 clock cycle and, since we unrolled the loop by a factor of 2, it means that we are actually starting two iterations of the loop at every clock cycle in a pipelined fashion. Up to now we always managed to achieve an initiation interval of 1 clock cycle and we achieved fairly good performances by combining unrolling and pipelining. Nevertheless, as already discussed on the unrolling optimization, the performance of loop optimizations can be limited by: constraints on the number of available memory ports and available hardware resources and, secondly, loop-carried dependencies. Let’s now try to push the loop optimizations further by unrolling out pipelined loop by a factor of 4. The idea is to start 4 loop iterations at every cycle in order to roughly halve the loop latency compare to our last result. However, looking at the Vivado HLS latency report, we see that the loop latency has not changed at all and it is still 520 cycles! We can see that despite the trip count has reduced from 512 iterations to 256 iterations, Vivado HLS achieves an initiation interval of 2 cycles instead of 1. Overall this means that 4 iterations of the original loop are executed every 2 cycles, which, in terms of performance, is equivalent to our previous configuration in which we executed 2 iterations of the loop every one cycle. But why Vivado HLS is not able to achieve an Initiation Interval of 1 cycle? The synthesis logs shed some light on the issue. Vivado HLS is not able to schedule a load operation for the array a and is forced to increase the initiation interval. Recall that the data that we use for our vector sum comes from BRAM memories which provides up to 2 memory ports. Hence, we cannot achieve an initiation interval of 1 cycle with 4 parallel iterations of the loop, since this would require to perform 4 load operations on both the local_a and local_b array. With the optimizations discussed so far, we cannot do better than starting 2 iterations in parallel at every cycle since we are limited by the number of memory ports for our local arrays. We will overcome this limitation when discussing the array partitioning optimization. Nevertheless, it is worth mentioning that our initial unoptimized implementation of the sum_loop required more than 10 thousands cycles and we now reduced it to 520 cycles! This represents approximately a 20x latency improvement. We have seen how memory ports restrictions reduce the available parallelism that can be exploited by loop pipelining, however, another potential issue is represented by loop-carried dependencies. Let’s consider again our scalar product example. In this code we have a loop carried-dependency due to the accumulation on the «product» variable. This aspect is better shown within the annotated analysis report. The loop carried-dependency path consist in the FADD and MUX operations within the loop. Indeed, the FADD results is given to the MUX which provides the result to the FADD at the next loop iteration. Overall, the unoptimized loop implementation executes 1024 iterations and each iteration requires 13 cycles. Let’s now try to optimize this loop by applying the loop pipelining optimization. What is the actual initiation interval that you would expect to achieve? Vivado HLS manages to start on iteration of the loop every 7 cycles. The reason for this is that in order to start a new iteration of the loop, we need to have computed the sum from the previous iteration. Since, in our settings, a floating-point addition requires 7 cycles, the best we can do is to start one iteration of the loop every 7 cycles. Notice that the Vivado HLS is also able to include within the 7 cycles the logic needed by the MUX operation which is part of the loop-carried dependency path. Despite we did not achieve an initiation interval of 1, we can see that by applying pipelining we reduce the loop latency from 13312 cycles to 7173 cycles: almost a 2x improvement! The performance improvement comes from the fact that we can start a loop iteration every 7 cycles instead of 13.