Cloud native EDA tools & pre-optimized hardware platforms
By John Kuhns, Design Consultant, Synopsys Professional Services
Are you prototyping your SuperSpeed USB (a.k.a USB 3.0) system in a HAPS FPGA-based prototyping environment? If yes, then read on to discover some tips and tricks on how to address the hardware challenges of your prototype. You will also gain insight on how to get your prototype up and running in a short period of time by taking advantage of solutions such as a implementing a simple USB device driver written in C and a WINUSB-based host driver to facilitate testing.
There are several hardware challenges that can cause difficulty in getting your prototype up and running. These can include configuring the interface, clocking, and timing closure.
1. Configuring the Interface for Reliable Operation:
For this project, Synopsys utilized the USB 3.0 high speed pipe interface, implemented on Synopsys’ HAPS USB 3.0 daughter card operating at 250 MHz with a 16-bit wide data bus. There are ‘tips’ that can be used to make this interface more reliable from one FPGA build to another. Otherwise, you can have FPGA builds that fail in hardware or take a long time to complete place and route. Placing the I/O registers in the I/O pads are required to give stable interface timing and is done by using the Synplify command ‘use_synioff’ or the Xilinx mapper switch. However, this technique will not work for the PhyStatus interface signal. The PhyStatus signal is used as both a digital control input and a causal signal. This signal causes the mapper to instantiate one global buffer and connects the global buffers output to both inputs. This creates a large delay to the PhyStatus signal relative to the other USB 3.0 input signals, resulting in the signals not appearing together on the same clock. One solution is to hand instantiate a global buffer that goes only to the clock input and connects the PhyStatus signal directly to the data input of the receiving flip-flop, without the global buffer in between. The top level of the DesignWare USB 3.0 core provides inputs labeled ‘PhyStatus’ for the data input, and ‘PhyStatus_async’ for the clock input (illustrated in Figure 1).
Figure 1: PhyStatus Problem and Solution
It may be necessary to use a placement or location command to be sure to get the correct flip-flop located into the I/O pad. The hierarchical name for this flip-flop will vary with your design, and perhaps with the configuration of your USB 3.0 core. However here is an example of a Xilinx placement command:
INST
"your_top/usb3_top_wrap_inst/usb3_top_inst/usb3_core/U_DWC_usb3_noclkrst/U_DWC_usb3_pwrm/
inst_port[0].U_DWC_usb3_pwrm_prt/U_DWC_usb3_pwrm_u3piu/phy_pipe3_rx_stage1[5]" LOC = ILOGIC_X0Y196;
Another item associated with the interface is the clock phase, which affects both the USB 3.0 pipe interface and the USB 2.0 ULPI. Different FPGA builds can require modifying the clock phase to accommodate timing changes from build to build due to placement and routing variations. Modifying the clock phase is accomplished by using the Xilinx Mixed Mode Clock Manager (MMCM) to phase shift the clock to correctly sample the input. This can require some experimentation to set correctly, but once the correct setting is found, it typically stays constant for subsequent builds. This phase shift can be implemented using the FPGA_EDITOR for Xilinx after the build is placed and routed, so experimental attempts can be made quickly.
2. USB Clocking Solutions with Xilinx FPGAs
Clock multiplexing is common in the DesignWare USB 3.0 controller to accommodate the different operational modes. This is implemented by utilizing the global buffer multiplexers (BUFGCTRL) provided in the Xilinx FPGA. However if the Xilinx tool is allowed to select the BUFGs, it is possible to have excessive clock insertion delays between domains. This causes the tool to fix large hold times, which can result in long run times and sometimes affect the quality-of-results. Virtex 6 offers ‘fast track’ connections between the BUFGs, which help alleviate large insertion delays. This involves placing the BUFGs adjacent to each other to utilize the ‘fast tracks’. An example of this is shown in Figure 2.
Figure 2: Using 'fast track' in Virtex 6
3. Overall Timing Closure
It is well known that ASICs offer higher performance than FPGAs and ASICs can meet worst case timing without any issues. FPGAs, however, have much more difficulty with this, which is why the USB controller will not meet worst case timing on the Xilinx Virtex 5 or Virtex 6 FPGAs. Fortunately since most FPGAs are significantly better than worst case, especially in laboratory conditions when voltage and temperature are controlled, you can create FPGA bit files that will be functional. Often, there are design changes that will require recreating bit files, which will vary in timing. It is advisable to use SmartXplorer provided by Xilinx to try various place and route switches running in parallel to achieve a bit file that will function in the laboratory. Doing many place and route runs in parallel will minimize the length of time to realize a working bit file. The length of time to create a bit file varies significantly depending on the design, and how highly utilized the FPGA is. Experience has shown that the time to create a bit file can vary from 4 to 10 hours. This can be significantly longer for challenging designs with higher density. Creating multiple bit files in parallel, and picking the best one to test in the lab is essential to having a reasonable turnaround time. The recommended number of bit files to run in parallel depends on the length of time for a single bit file, and your available computing resources.
There can be a delay from when the prototype is ready to be tested and when the software and software environment is ready for integration. This condition existed on a recent Synopsys project. The solution was to take the Synopsys Linux device driver and modify the driver so that it could operate standalone without an operating system. The modified driver supported early testing for both control and bulk transfers. This driver was compiled and run on a customer system incorporating the DesignWare ARC processor.
Modifying the device driver required the following:
A WINUSB-based application ran on a host PC to facilitate this testing. Bulk transfers were sent from the host, received by the device, and then sent back to the host for checking.
The simple device driver along with the WINUSB host driver performs the following operations:
Memory management for the device DMA running in the prototype was complicated in this project because only one section of memory could handle the 64-bit transfers required for this project’s configuration of the USB 3.0 core. A ‘brute force’ approach was taken to hard code the various Transfer Request Blocks (TRB) and buffer locations to jump start the testing. It would be desirable for this approach to be modified to use the memory management tools available in C and the OS used for specific projects going forward.
Prototyping your system is invaluable to the software development and verification of your system. The prototype must be integrated and running in a timely manner if you are to receive the full value. With these tips your hardware will be working hard for you, helping you meet your development schedule. Synopsys Professional Services have consultants with experience in SoC design, integration, verification and prototyping not only with USB, but with all Synopsys IP.
For copies of the software and scripts discussed in the article please contact Synopsys.