The Xilinx SDAccel Development Environment let the user express kernels in OpenCL C, C++ and RTL (as an example we can think of, SystemVerilog, Verilog or VHDL) to run on Xilinx programmable platforms. The programmable platform is composed of: the SDAccel Xilinx Open Code Compiler (XOCC), a Device Support Archive (DSA) which describes the hardware platform, a software platform, an accelerator board, and last but not least, the SDAccel OpenCL runtime. The Xilinx Open Code Compiler takes the user source code and runs it through the Xilinx implementation tools to generate the bitstream and other files that are needed to program the FPGA based accelerator boards. The SDAccel environment provides a debugger, a profiler, as well as libraries to efficiently develop both the host and the kernel code. Finally, SDAccel relies on PCIe technology to make the host and the kernel on the FPGA properly communicate, and if you had ever worked with an FPGA via PCIe you know how this feature can be definitely useful! Here you can have a look at the SDAccel design flow. The user has to provide SDAccel with both an Host file and a Kernel file, which can be written, as we know, in C/C++, OpenCL, or RTL. At this point, the user can optimize the kernel and perform both a software and hardware emulation, to evaluate correctness of the design. Moreover, SDAccel produces reports containing information about the circuit latency, the resource usage, and so on. By looking at these information, the user can identify the bottlenecks of the current design and optimize them. Once satisfied, the user can start the system build. SDAccel will take care of all the System Level Design steps, and will produce the FPGA bitstream file, as well as the host executable for the OpenCL runtime management. At the end of this process, we are ready to start our design on the FPGA! To better understand how SDAccel works, considering that it is based on OpenCL memory and computational model, let us now focus on OpenCL platform. OpenCL is an open, cross platform parallel programming language for heterogeneous architectures; indeed, it allows to write portable code for multiple architectures, like CPUs, GPUs, FPGAs, and so on. Of course, the fact that OpenCL code is portable for multiple architectures does not imply that the performance are portable! Therefore, the user has to carefully design the OpenCL applications according to the target architecture. I will now give you a formal description of the OpenCL Platform, and then provide you an explanatory example. The two main components of OpenCL Platform are the host and the device. Let us first describe the HOST. The host has many and different responsibilities. First of all, it manages the operating system and enables drivers for all devices. This means that the host is aware of the number of devices connected to the host and can decide which device to use for the computation. Of course, the host is also in charge of executing the application host program. Within the program, the host creates and manages the memory buffers. Such buffers will then be copied on the device memory, and, after the computation, copied back to the host memory. Finally, the host launches a monitor for the kernel execution on the OpenCL device. In this way, the host can infer information about the kernel execution, like the execution time, and check when the computation is done. On the other hand, the OpenCL device is in charge of executing the kernel. Therefore, in case of an OpenCL device powered by an FPGA, the device is partially reconfigured at runtime to implement the required functionality. According to OpenCL terminology, the device is divided into multiple compute units, each compute unit executes a work-group, and, finally, each work-group contains multiple work-items. Right now, we do not need a better definition of work-item and work-group, but don’t worry, this will be done later! Furthermore, we can further divide a compute unit into Processing Elements (PEs), which are responsible for the execution of a work-item. Within this context, as you may have already understood, OpenCL computational model is built around the logic abstraction of work-item and work-group, but what are work-items and work-groups? A work-item is the basic unit of work within an OpenCL device, while a work-group is a group of work-items. Within an OpenCL code, OpenCL computational model requires the user to specify both the Global size, the N-dimensional size of the total number of work items, and the Local size, like the N-dimensional work-group size. In other words, the user has to specify the total number of work-items allocated on the OpenCL device, as well as how such work-items are distributed among the work-groups. Global and local size can be 1D, 2D, and 3D, according to the dimensionality of the problem to process. This means that OpenCL can process, at most, 3D problems.