ComplexMultiplier Implementation for Resource Flexible
Radix22 Algorithms has been chosen to implement and the different blocks in the architecture. After implementing the blocks, the complex multiplier implementation has been done in four different approaches starting from the basic complex multiplication.
Рубрика  Программирование, компьютеры и кибернетика 
Вид  дипломная работа 
Язык  английский 
Дата добавления  01.04.2010 
Размер файла  4,3 M 
Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже
Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.
60
39
ComplexMultiplier Implementation for Resource Flexible
Pipelined FFTs in FPGAs
Master thesis in Electronic Systems at Linkoping Institute of Technology by
Praneeth Kumar Thangella & Aravind Reddy Gundla
LiTHISYEX09/4155SE
Supervisor & Examiner: Oscar Gustafsson Division of Electronic Systems, Dept. of Electrical Engineering Linkoping, 27 January 2009
Presentation Date 27012009
Publication Title
Complex Multiplier Implementation for Resource Flexible Pipelined FFTs in FPGAs
Authors
Praneeth Kumar Thangella & Aravind Reddy Gundla
Abstract
Different approaches for implementing a complex multiplier in pipelined FFT are considered and implemented to find an efficient one in this project. The implemented design is synthesized on Cyclone II and Stratix III to know the performance. The design is implemented with a focus of reducing the resources used. Some approaches resulted in the reduced number of DSP blocks and others resulted in reduced number of LUTs. Analysis of Synthesis results is performed for different widths (bit lengths) of complex multiplier approaches.
Keywords
VHDL, FFT, FPGAs, complexmultiplier, LUT, DSP block, utilization and twiddle factors
Acknowledgement
Our sincere thanks to our examiner and supervisor Oscar Gustafsson for giving us such an interesting project, guiding and helping us whenever required from the start to the end of the project.
And we thank Kent Palmkvist for helping us during Synthesis and VHDL
Notations
DFT Discrete Fourier Transform. FFT  Fast Fourier Transform. FPGA  Field Programmable Gate Array. R2MDC  Radix2 Multipath Delay Commutator. R2SDF Radix2 Singlepath Delay Feedback. R4SDF Radix4 Singlepath Delay Feedback. R4MDC  Radix4 Multipath Delay Commutator. R4SDC  Radix4 Singlepath Delay Commutator. R2^{2}SDF  Radix2^{2} Singlepath Delay Feedback.
VHDL Very High Speed Integrated Circuits Hardware Description Language.
DSP  Digital Signal Processing.
LUT  Look up Tables.
N  npoint DFT.
W width of input data.
Table of Contents
Abstract
Acknowledgement
Notations
List of figures in the report
List of Tables in the report
List of Graphs in the report
1. Introduction
DFTs, FFTs, its advantages and applications in FPGAs
DFT Algorithm
Theme of the Report
Content of the document
2. Basic Pipeline Architectures
Introduction
Pipeline Architectures
R2MDC (Radix2 Multipath Delay Commutator):
R2SDF (Radix2 Singlepath Delay Feedback)
R4SDF (Radix4 Singlepath Delay Feedback)
R4MDC (Radix4 Multipath Delay Commutator)
R4SDC (Radix4 Singlepath Delay Commutator)
2.3 Comparison of Different architectures
3. Radix 2^{2} FFT architecture
Introduction
Working of R2^{2}FFT architecture
Working of Butterfly Structures:
Calculation of twiddle factors in MATLAB:
4FFT Design
Introduction
Information on the FPGAs used in synthesis
Complete flow of the project
Implementation of Complex Multiplier block and Stage
First approach (Normal Complex Multiplier) of implementing complex multiplier
Second approach of implementing complex multiplier
Third approach of implementing complex multiplier
Fourth Approach of implementing complex multiplier
Tools and languages used for the implementation of the project
Testing
Analysis of the Result
Analysis of synthesis results (LUTs & DSP blocks consumed by different complex multiplier approaches with FPGA Cyclone II)
Analysis of synthesis results (LUTs & DSP blocks consumed by different Multiplier approaches with FPGA Stratix III)
Analysis of synthesis results (before and after pipelining)
4.7 Conclusion of the project
5. Problems faced during the course of the project
Problem in understanding FFT Architecture
Problem in testing
Finding Twiddle factor coefficient
. Future Work
Summary
Bibliography
List of Tables in the report
Table 2.1: Comparison of Different FFT Pipeline Algorithms
Table 4.1: Shows the stage name and number along with the size of the DFT
Table 4.2: Table represents the consumed Data Arrival Time, LUTS and DSP blocks for different stages for First Approach (Normal Complex Multiplication) with FPGA Cyclone II
Table 4.3: Table represents the consumed Data Arrival Time, LUTS and DSP blocks for different stages for First Approach (Normal Complex Multiplication) with FPGA Stratix III
Table 4.4: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with second approach with FPGA Cyclone II
Table 4.5: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with second approach with FPGA Stratix III
Table 4.6: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with third approach with FPGA Cyclone II
Table 4.7: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with third approach with FPGA Stratix III
Table 4.8: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with fourth approach with FPGA Cyclone II
Table 4.9: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with fourth approach with FPGA Stratix III
Table 4.10: Shows the number of Data Arrival Time (ns*10^{1}), LUTS, DSP Blocks with respect to the bit widths of the First Approach of Complex Multiplier Implementation with FPGA Cyclone II
Table 4.11: Shows the number of Data Arrival Time (ns*10^{2}), LUTS, DSP Blocks with respect to the bit widths of the first Approach of Complex Multiplier Implementation with FPGA Stratix III
Table 4.12: Shows the number of Data Arrival Time (ns*10^{1}), LUTS and DSP Blocks consumed with respect to Bit Widths of the Second Approach of Complex Multiplier Implementation with FPGA Cyclone II
Table 4.13: Shows the number of Data Arrival Time (ns*10^{1}), LUTS and DSP Blocks consumed with respect to Bit Widths of the Second Approach of Complex Multiplier Implementation with FPGA Stratix I II
Table 4.14: Shows the number of Data Arrival Time (ns*10^{1}), LUTS and DSP Blocks consumed with respect to Bit Widths of the Third Approach of Complex Multiplier Implementation with FPGA Cyclone
Table 4.15: Shows the number of Data Arrival Time (ns*10^{1}), LUTS and DSP Blocks consumed with respect to Bit Widths of the Third Approach of Complex Multiplier Implementation with FPGA Stratix
Table 4.16: Shows the number of Data Arrival Time (ns*10^{1}), LUTS and DSP Blocks consumed with respect to Bit Widths of the Fourth Approach of Complex Multiplier Implementation with FPGA Cyclone
Table 4.17: Shows the number of Data Arrival Time (ns*10^{1}), LUTS and DSP Blocks consumed with respect to Bit Widths of the Fourth Approach of Complex Multiplier Implementation with FPGA Stratix
Table 4.18: Number of I/Os, LUTs, DSP blocks, Registers, Memory Bits and Data Arrival Time of a 64  point Radix 2^{2} FFT before and after pipelining using registers
Table 4.19: Values which are different from table 4.13
Table 4.20: Corresponding Values from table 4.13
List of Graphs in the report
Graph 4.1: Number of LUTs consumed for different stages with different Complex Multiplication approaches with FPGA Cyclone II
Graph 4.2: Number of LUTs consumed for different stages with different Complex Multiplication approaches with FPGA Stratix III
Graph 4.3: The Data Arrival Time (in ns) required for different stages with different Complex Multiplication approaches with FPGA Cyclone II
Graph 4.4: The Data Arrival Time (in ns) required for different stages with different Complex Multiplication approaches with FPGA Stratix III
Graph 4.5: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of First Approach of Complex Multiplier versus Bit Widths (Xaxis) with FPGA Cyclone II
Graph 4.6: Plots the Data Arrival Time (ns*10^{2}), LUTs and DSP blocks of First Approach of Complex Multiplier versus Bit Widths (Xaxis) with FPGA Stratix III
Graph 4.7: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Second Approach of Complex Multiplication versus Bit Widths (Xaxis) with FPGA Cyclone II
Graph 4.8: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Second Approach of Complex Multiplication versus Bit Widths (Xaxis) with FPGA Stratix III
Graph 4.9: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Third Approach of Complex Multiplication versus Bit Widths (Xaxis) with FPGA Cyclone II
Graph 4.10: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Third Approach of Complex Multiplication versus Bit Widths (Xaxis) with FPGA Stratix III
Graph 4.11: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Fourth Approach of Complex Multiplication versus Bit Widths (Xaxis) with FPGA Cyclone II
Graph 4.12: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Fourth Approach of Complex Multiplication versus Bit Widths (Xaxis) with FPGA Stratix III
1. Introduction
DFTs, FFTs and FPGAs
FFT is an algorithm to compute Discrete Fourier Transforms (DFT) and its inverse applications. FFT has many applications, in fact any field of physical science that uses sinusoidal signals, such as engineering, physics, applied mathematics, and chemistry. FFT computations play a vital role in the computations of FPGAs. FPGAs are faster to manufacture and time to market is less http://www.xilinx.com/company/gettingstarted/fpgavsasic.htm^{}. FPGAs have simpler design cycle due to the software that handles most of the routing, placement and timing. With increase in the logic density and other features such as DSP blocks, Clocking and highspeed serial at lower price points, FPGAs are suitable for any type of design. A new bit stream can be uploaded remotely into the FPGAs. As there is need for the FPGAs to run faster and efficient, our project deals with improving the efficiency of the FPGAs by improving the FFT. The main applications of this project can be in OFDM Transceivers and Spectrometers.
DFT Algorithm
DFT has played a key role in the development of Digital Signal Processing concepts.
DFT transform can be defined as
X(k) = In=ox(n)I/K_{n}^{nk} where k e {0,N1}
Where l/l/_{w}=e"^{2nj/N} is the N^{th} root of unity.
The inverse DFT can be defined as
X(k) =I^=o http://www.xilinx.com/company/gettingstarted/fpgavsasic.htm^{}x(n)W_{n}^{nk} n = {0, N1}
Where X(k) and x(n) are complex sequences of length N. The computation of DFT and IDFT requires N^{2} complex arithmetic operations.
Radix2 DIF and DIT FFT algorithm definitions:
i) The decimationintime (DIT) radix2 FFT recursively partitions a DFT into two halflength DFTS of evenindexed and oddindexed time samples. The outputs of these shorter FFTs are reused to compute many outputs, thus greatly reducing the total computational cost.
1.3 Theme of the Report
After considering all the Pipeline FFT architectures we had concluded to implement Radix2^{2} DIF FFT algorithm, as it is one of the best and efficient algorithm to compute DFT. We have implemented complex multiplier in different ways and after synthesis we have analyzed the variation on the consumption of the DSP blocks, LUTS and Registers.
Content of the document
This part describes the contents of the chapters in this report.
Chapter 1 Introduction describes the general introduction of the DFTs and FFTs .It also includes the main goal of this project and describes the contents in each chapter in this report.
Chapter 2 Basic Pipeline Architectures describes the different pipeline architectures that are considered for choosing the Architecture for this project. And it also gives the comparison of different architectures. This chapter also covers why Radix2^{2} FFT architecture is chosen.
Chapter 3 Radix 2^{2} FFT Architecture describes the architecture and working of Radix2^{2} FFT architecture. It also contains the calculation of twiddle factor values using the MATLAB.
Chapter 4 FFT Design describes the way in which Radix2^{2} architecture is implemented in this project. This chapter discussion includes how the project is carried out from the starting phase to the end of the project. This chapter also includes the workflow, tools used, methods adopted for the project and the results obtained.
Chapter 5 Problems faced describes the problems faced during the project.
Chapter 6 Future Work describes the work that can be done in extension to this project and other things that can be considered to be implemented in the future.
Chapter 7 Summary describes the summary of the complete thesis.
Chapter 8 Bibliography describes the references that have been used in writing this document.
2. Basic Pipeline Architectures
Introduction
This chapter introduces few pipeline architectures and comparison between them.
Pipeline Architectures
2.2.1 R2MDC (Radix2 Multipath Delay Commutator):
The input sequence Sousheng He, and Mats Torkelsson. "A New Approach to Pipeline FFT Processor". Department of Applied Electronics ,Lund University, SWEDEN.^{} is divided into two streams and then given as inputs to butterflies. This architecture^{1} requires log2N2 multipliers, log2N radix 2 butterflies and 1.5N2 registers. The butterflies and multipliers have 50% utilization. In each stage half the data is delayed through memory and the other half data processes through butterflies. Thereby multipliers are utilized to 50%.The delay in each stage is N/2,N/2,N/4,N/8,,2,respectively http://www.es.isy.liu.se/publications/papers_and_reports/1999/weidongl_ICSPAT99.pdf^{}.
+ 8 > 
> 
> 4 > 
> 
> 

C2 
> 
BF2 
*<8>> 4 > 
C2 
> 
BF2 
Kgy>_2}* 
C2 
> 
BF2 
j 
C2 
 
 
BF2 

Figure 2.1: R2MDC (Radix2 Multipath Delay Commutator) for N= 16
2.2.2 R2SDF (Radix2 Singlepath Delay Feedback):
The basic operation of Radix2 Singlepath Delay Feedback E.H. Wold and A.M. Despain. Pipeline and parallelpipeline FFT processors for VLSI implementation. IEEE Trans. Comput.,C33(5):414426,May 1984.^{} architecture is , the first inputs up to K_{i} are shifted to the shift registers with the length of 2^{n1}. Later the incoming data will be combined with the data from the shift register using the 2 point DFT.
^{L}i = ^{K}i + ^{K}i+N/2
Li+N/2 = Ki  Ki+N/2where i = 0,,N/21
The values Li are sent to the rotor to apply the twiddle factors, whereas the values L_{i}+_{N}/_{2} sent back to shift registers. After all the N/2 point DFTs are computed the values L_{i}+_{N}/_{2} are sent out of the shift registers to rotor ,while the first half of the next transform is loaded in to the shift
Figure 2.2: R2SDF (Radix2 Singlepath Delay Feedback) for N= 16
2.2.3 R4SDF (Radix4 Singlepath Delay Feedback):
Radix4 Singlepath Delay Feedback method is the same as R2SDF, but with three delay lines per butterfly instead of one. In this architecture 3 out of 4 radix 4 butterfly outputs are stored; thereby the utilization of
multipliers increases to 75% and the adders only utilized to 25%. This needs^{1} log4N 1 complex
Multipliers, 8log4N complex adders and N1 shift registers. The delay for each stage is 3N/4, 3N/16,,3 respectively^{2}.
Figure 2.3: R4SDF (Radix4 Singlepath Delay Feedback) for N=256
2.2.4 R4MDC (Radix4 Multipath Delay Commutator):
Radix4 Multipath Delay Commutator^{1} architecture has utilization of 25% of all the components. It requires^{1} 3 log_{4} N multipliers, log_{4} N full radix4 butterflies and 5/2N  4 registers. The first stage needs a 3N/2 word memory and the rest of stage need 3N/4,3N/16,,12 words^{2}
Figure 2.4: R4MDC (Radix4 Multipath Delay Commutator) for N=256
2.2.5 R4SDC (Radix4 Singlepath Delay Commutator):
it is implemented with log4N stages with each stage having a multiplier, Commutator and a butterfly element .Each commutator in this method has 6 shift registers and 3 multiplexers G. Bi and E. V. Jones. A pipelined FFT processor for wordsequential data. IEEE Trans. Acoust., Speech, Signal Processing, 37(12):19821985, Dec. 1989.^{} .This also reduces the requirement on memory.
R4SDC contains^{1} log4N1 multipliers, 3log4N adders and 2N data memories thereby having 100 % efficiency in adders and 75 % efficiency in multipliers^{4}. This architecture can be used for mixed and uniform radix multiplications. Significant savings have been achieved in this model with provided controllable adder/subtractor within the butterfly element and a few associated
control signals. The number of words which are necessary to be stored is 3N/2,3N/8,6
Figure 2.5: R4SDC (Radix4 Singlepath Delay Commutator) for N= 256
2.3 Comparison of Different architectures
When comparing different pipeline architectures, considering the number of Multipliers, Adders, Memory Size and controlling of the algorithms, R2^{2} FFT architecture is the best with less number of Multipliers, Adders and Memory Size and is easy to control. So after considering all the advantages of this architecture, in our project we decided to implement this architecture.
3. Radix 2^{2} FFT architecture
3.1 Introduction
This chapter discusses the working of the Radix 2^{2} FFT architecture and includes a discussion on calculation of twiddle factor values in MATLAB.
Working of R2^{2}FFT architecture
As FFT is efficient algorithm to calculate DFT and its inverse.
The basic equation of DFT with size N can be defined as
X(k) = In=ox(n)I/K_{w}^{nk}where 0 < k < N (3.1)
Here W_{N} represents the primitive root of unity. And considering the first 2 steps of decomposition in the radix2 DiF FFT
R2^{2} FFT^{1} algorithm can be derived in the following way:
Applying the 3dimensional linear index map
n= <N/2n_{1}+N/4n_{2}+n_{3}> N
k=<k_{1} + 2k_{2} + 4k_{3}>N (3.2)
By using the common factor algorithm (CFA) to the basic DFT equation gives X (k_{1} + 2k_{2} + 4k_{3})
(N/2n_{1}+N/4n_{2}+n_{3})(k_{1}+2k_{2}+4k_{3})= Y^{4} ^{1} I' __{o}In_{3}=ox(N/2n_{1} + N/4n_{2} + n_{3})^_{v}
n_{1}=0 ^{2}
^{N} 1
_{V}1 rro k1_{w},... (N/4n2+n3)(k1) (N/4n2+n3)( 2k2+4k3) l_{n}=_{o}((Bn/2 )(N/4n2+n3) Wn ^{2} ^{3} ^{1}) Wn ^{2} ^{3} ^{2} ^{3 }'3=0 ^{2}
And the butterfly structure has the form
B_{N}/_{2}^{K1}(N/4n_{2}+n_{3}) = X(N/4n_{2}+n_{3}) + (1)^{K1}X(N/4n_{2}+n_{3} +N/2)
Decomposing the composite twiddle factor in equation (c) gives
(N/4n2+n3). (Nn_{2}k_{3})... N/4n_{2}( k!+2k_{2}l n_{3}( ki+2k_{2}) 4n_{3}k_{3}_{ }= Wn WnWn Wn
n_{2}( k,+2k_{2}) n_{3}( k,+2k_{2}) 4n_{3}k_{3 }= (j) ^{2} ^{1} ^{2} Wn ^{31} ^{1} ^{2} Wn ^{3} ^{3} (3.4)
Substituting the equations (d) in (c) and expanding the summation. After simplifying the equations we have a set of 4 DFTS of length N/4.
X (k1 + 2k2 + 4k3) = U__{0}[H(k_{1},k_{2},n_{3})Wn^{n3(} ^{k}^{1}+^{2k}^{2}^{)} ]Wn/4^{4n3k3} (3.5)
Where
_{k(k} _{+2k} _{)} _{k}
H (k1,k2,n3) = [x(n3)+(1) ^{1}x(n3+N/2)] + ^{1} ^{2}'[x(n3+N/4)+(1) ^{1}x(n3+3/4N)] (3.6)
x(n_{3})+(1) ^{1}x(n_{3}+N/2)] = BFI (3.7)
x(n_{3}+N/4)+(1) ^{1}x(n_{3}+3/4N) = BFI (3.8)
_{k(k} _{+2k} _{)} _{k}
x(n_{3})+(1) ^{1}x(n_{3}+N/2)] + (j) ^{1} ^{2} [x(n_{3}+N/4)+(1) ^{1}x(n_{3}+3/4N) = BFII (3.9)
Equation (f) represents the first 2 stages of the butterflies.
After these stages of BFI and BFII, multipliers are required to calculate the decomposed Twiddle factors (W_{N}^{n3(} ^{k}^{1}+^{2k}^{2}^{)} ) the equation 3.5. Applying the Common Factor Algorithm to the remaining DFTs of length N/4 , complete Radix 2^{2} FFT algorithm can be obtained .
3.3 Working of Butterfly Structures:
On the first N/2 clock cycles the 2to1 multiplexers in the first butterfly switch to position zero and the input data from the left is filled into the shift registers until they are filled and the butterfly will be in idle stage. On the next N/2 clock cycles the multiplexers turn to position 1 and the first butterfly calculates the 2point DFT to the input data and the data coming from the shift registers.
Z1(n) = x(n)+x(n+N/2)
Z1(n+N/2)=x(n)x(n+N/2)0 < n < N/2
The output from the first butterfly (Z1(n)) is sent to apply the twiddle factor and Z1(n+N/2) is sent back to the shift registers that is multiplied in still next N/2 cycles when the first half of the next frame of time sequences is loaded in. Second Butterfly is similar to that of the first one except the twiddle factor implementation and distance of butterfly input sequence are N/4.
j(W_{N}^{N/4}) is used to do the twiddle factor multiplication. Twiddle factor multiplication has been implemented to do realimaginary swapping which can be done by a commutator and controlled add/subtract operations. Synchronizing counter is used to control the add/subtract operations and the logic gate. Further processing repeats this pattern with the distance of the input data decreases by half at each consecutive butterfly stages. After N1 clock cycles th DFT transform comes from the output in bit reversed order. Due to the pipelined processing of each stage the next frame of transform can be computed without any pause.
X(0)
X(8)
X(4)
X(12)
X(2)
X(10)
X(6)
X(14)
X(1) X(9) X5)
X(13) X(3 )
X(1 1)
X(7)
X(15)
3.4 Calculation of twiddle factors in MATLAB:
Twiddle factor coefficients can be calculated using
nk j21Jnk/N W_{N} =e
Where, W_{N} are the twiddle factors
N is the size of the DFT For example, e ^{jx} =cosxjsinx Here in this case x=2TTnk/N;
Twiddle factor coefficients are calculated from MATLAB.0*area 1*area 2*areaN/4_{}1* area
From the figure 3.4, the twiddle factor values are W , W , W W
They repeat 4 times and area =0 for first N/4 cycles, area=2 for second N/4 values, area=1 for third N/4 values and area=3 for last N/4 values.
From the figure 3.3 if we consider for 64Point there are 2 stages that need twiddle factor values. Here in the equations 3.13, 3.14 we have used variable "s" which represents the stage.
Index is the value of nk from equation 3.10
The value of index was calculated by using the values of area, stage and mul. mul is the variable that is used to calculate the index
mul= N/g (3.12) Variable g is used to calculate the index.
g=N/ (2* (2*s)) (3.13) Finally index is calculated by equation
index =(mod(i,(g/4))) * mul * area. (3.14) Real values of the twiddle factors are calculated by using the cosine function. Real values = cos (2*TT*index/N). (3.15) Imaginary values are calculated by using the sine function.
Imaginary values = =sin (2*TT*index/N). (3.16) The real and imaginary values are written into a file using MATLAB.
The future work can be finding the above mentioned variables in VHDL to get twiddle factors instead of calculating them in MATLAB. Figure 3.5 shows how this implementation looks like.
In the equation 3.16 assuming k = 2*TT*index/N.
k 
sin(k) or cos(k) 

w 
w 
__ w 

Figure 3.5: Future work that can be implemented
4. FFT Design
Introduction
This chapter describes the implementation of FFT and discusses the different complex multiplier approaches along with the analysis and presentation of the results.
Information on the FPGAs used in synthesis
FPGA's are finding an extensive application in the field of Digital signal processing. They consist of DSP blocks, LUT's, Register's and I/O's in order to implement a logic function. In this project synthesis is performed on two devices with very different features. Cyclone II devices are suited for multiplier intensive low cost DSP applications and Stratix III devices http://www.altera.com/products/devices/stratixfpgas/stratix/stratix/features/stxdsp.html^{} are used for architecturally advanced, high performance and low power FPGAs. EP2C35F672C from Cyclone II and EP3SEE50F780C from Stratix III family are selected for implementing the FFT in the project.
From the Synthesis results obtained, EP2C35F672C from Cyclone II family consists of 475 IO's, 33216 LE's and 473 Kbit RAM and 70 DSP blocks (9bit elements). LE is implemented with a LUT and a flipflop (Register).
DSP blocks in this device have embedded multipliers for enhancing performance of the FPGA thereby reducing the need for other resources. In this Project, they are used for implementing multipliers.
LUT's along with Registers are the basic blocks in cyclone II FPGA. LUTs are used during the implementation of addition operation in the FPGA.
Register is a group of flip flop used to latch and store data. In this project they are used for introducing pipelining and also for implementing shift registers.
From the synthesis results obtained, EP3SEE50F780C from Stratix III family consists of 488 IO's, 38000 LE's and 5328 Kbit RAM and 384 DSP blocks (18bit elements or 18x18 multipliers). LE is implemented with a LUT and a flipflop (Register).
DSP blocks http://www.altera.com/literature/hb/stx3/stx3_siii51005.pdf^{} of Stratix III devices consists of dedicated elements for performing multiplication, addition, subtraction, accumulation, summation, and dynamic shift operations and are ideally suited for complex systems which require a large number of mathematical computations. "Stratix DSP blocks http://www.altera.com/literature/wp/wpstxvrtxII.pdf^{} are more than simple multipliers; each DSP has configuration capabilities to
LUTs in this device are used during the implementation of addition operation in the FPGA.
Registers in this device are used for introducing pipelining and also for implementing shift registers.
4.3 Complete flow of the project
We have started with implementing all the blocks in VHDL .Each individual blocks have been combined together and working of the complete architecture block is tested .After implementing all the individual blocks, pipelining is done by introducing registers to the architecture, to reduce the critical path and to increase the performance of the FFT. Complex Multiplier is implemented in different ways to figure out the number of DSP blocks and LUTs that are consumed for each approach.
4.3.1 Implementation of Complex Multiplier block and Stage
4.3.1.1 Implementation of the Normal Complex multiplier
Considering two complex numbers
A=ar+j*ai and B=br+j*bi
The product of these two complex numbers can be given as
A*B= (ar*br  bi*ai) + j*(ar*bi + ai*br)
Figure 4.1 shows the general complex multiplication of two complex numbers. Here in the figure, A and B are two complex numbers and the output is given separately for real and imaginary parts.
4.3.1.2 Implementation of a Stage
BF2_re BF2_im
Multiplier
A
C>out_re
out_im
tw_re
tw_im
Figure 4.2: Block diagram representing the blocks in a stage
The above figure 4.2 shows the components in each stage. Where, BF2_re (real) and BF2_im (imaginary) in each stage are the inputs from the butterfly2. Tw_re (real) and tw_im (imaginary) are inputs from twiddle factors which are calculated from MATLAB and stored in the memory.
The table 4.1 shows the name of the stage along with the size of the DFT
4.3.2 First approach (Normal Complex Multiplier) of implementing complex multiplier
The multiplication A*B can be implemented in a straight forward way by using four multipliers and two adders.
The table below shows the number of LUTS, DSP Blocks and data arrival time that are consumed for each stage for different number of points (size of DFT). Here we had considered for 1024, 256 and 64 points.
The below given synthesis results in tables 4.2 and 4.3 are with FPGAs Cyclone II, Stratix III respectively with first approach of complex multiplication.
4.3.3 Second approach of implementing complex multiplier
DSP is multiplication intensive technology. So it is worthy to save number of embedded multipliers or DSP blocks.
The multiplication A*B can also be implemented by using 3multipliers and 5 adders as per the below equation. The only difference from normal computation is the utilization of sum of the twiddle factors (real and imaginary values) for the computation which reduces the number of adders.
Pr = ar * (br+bi)  (ar+ai) * bi Pi = ar * (br+bi) + (aiar) * br
A=ar+jai
B=br+jbi
re im
re im
ar
br
bi
The below given synthesis results in tables 4.4 and 4.5 are with FPGAs Cyclone II, Stratix III respectively with Second approach of complex multiplication.
4.3.4 Third approach of implementing complex multiplier
Here the inputs coming from the ROMs are added before it is given as input to the multiplier and is given as third input along with other two inputs. This reduces number of adders and the total implementation takes four adders and three multipliers.
60
39
Pr = ar*(br+bi)(ar+ai)*bi
60
39
4.3.5 Fourth Approach of implementing complex multiplier
Pi = ar*(br+bi)+(aiar)*brIn this approach Error analysis and complexity optimization for the multiplierless FFTlike transformation (MLFFT) Tsui, K.M. Chan, S.C. Tse, K.W. Dept. of Electr. & Electron. Eng., Hong Kong Univ., China.^{} the coefficient values tan Z/2 and Sin Z are calculated from the MATLAB. This approach still reduces the number of LUTs consumed and is implemented with 3 multipliers and 3 adders.
The below given synthesis results in tables 4.8 and 4.9 are with FPGAs Cyclone II, Stratix III respectively with fourth complex multiplier approach.The below given graphs in 4.1 and 4.2 shows the number of LUTs consumed for each stage with all the four complex multiplier approaches implemented on FPGAs Cyclone II, Stratix III respectively.
Graph 4.1: Number of LUTs consumed for different stages with different Complex multiplication approaches implemented on FPGA Cyclone II.
Graph 4.2 Number of LUTs consumed for different stages with different Complex multiplication approaches implemented on FPGA Stratix III.
Graph 4.3: The Data Arrival Time (in ns) required for different stages with different Complex multiplication approaches implemented on FPGA Cyclone II
Complex Multiplier(FirstApproach)
Complex Multiplier(SecondApproach)
Complex Multiplier(ThirdApproach)
Complex Multiplier(FourthApproach)
60
39
Graph 4.4: The Data Arrival Time (in ns) required for different stages with different Complex multiplication approaches implemented on FPGA Stratix III
From the graphs 4.1 to 4.4 it can be concluded that the fourth approach is the least complex with longest delay.
The below given table 4.10 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with first complex multiplier approach with FPGA Cyclone II.
Width(in bits) 
Data Arrival Time(ns*10^{1}) 
LUTS 
DSP Blocks 

2 
53.82 
26 
0 

3 
73.27 
14 
4 

5 
76.51 
22 
4 

7 
78.55 
30 
4 

The below given graph 4.5 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with first complex multiplier approach with FPGA Cyclone II.
Graph 4.5: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of First Approach of Complex multiplier versus Bit Widths (Xaxis) implemented on FPGA Cyclone II
The below given table 4.12 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with second complex multiplier approach implemented on FPGA Cyclone II.
Width(in bits) 
Data Arrival Time((ns*10^{1})) 
LUTS 
DSP Blocks 

1 
42.74 
1 
0 

2 
61.34 
29 
0 

3 
86.63 
26 
3 

5 
84.59 
40 
3 

7 
88.85 
54 
3 

9 
93.11 
68 
3 

The below given graph 4.7 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with second complex multiplier approach implemented on FPGA Cyclone II.
Graph 4.7: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Second Approach of Complex multiplication versus Bit Widths (Xaxis)
Implemented on FPGA Cyclone IIThe below given graph 4.8 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with second complex multiplier approach implemented on FPGA Stratix III.
The below given table 4.14 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with third complex multiplier approach implemented on FPGA Cyclone II.
Width(in bits) 
Data Arrival Time(ns*10^{1}) 
LUTS 
DSP Blocks 

1 
42.74 
1 
0 

2 
65.24 
28 
0 

3 
86.30 
22 
3 

The below given graph 4.9 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with third complex multiplier approach implemented on FPGA Cyclone II.
Graph 4.9: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Third Approach Complex Multiplier implemented on versus Bit Widths (Xaxis) with FPGA Cyclone II
The Below given table 4.16 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with fourth complex multiplier approach implemented on FPGA Cyclone II.
Width(in bits) 
Data Arrival Time (ns*10^{1}) 
LUTs 
DSP Blocks 

2 
77.19 
9 
0 

3 
161.28 
16 
3 

5 
149.31 
20 
3 

7 
152.37 
28 
3 

9 
155.43 
36 
3 

The below given graph 4.11 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with fourth complex multiplier approach implemented on FPGA Cyclone II.
Graph 4.11: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Fourth Approach Complex Multiplier Implementation versus Bit Widths (Xaxis) implemented on FPGA Cyclone II
Graph 4.12: Plots the Data Arrival Time (ns*10^{1}), LUTs and DSP blocks of Fourth Approach Complex Multiplier Implementation versus bit Widths (Xaxis) with FPGA Stratix III
The below given table 4.18 shows the Synthesis results of a complete 64 point Radix 2^{2} FFT architecture implementation before and after pipelining with first multiplier approach using FPGA Cyclone II.
Tools and languages used for the implementation of the project
The project is implemented in VHDL. HDL Designer, MODELSIM and Precision RTL Synthesis are used for design creation, simulation and synthesis of the design. MATLAB is mainly used in computing the values of the twiddle factors.
Testing
Each individual block is tested separately with different input values. After all the individual blocks were tested successfully, they were implemented together. And values obtained from the implemented FFT architecture are compared with the values obtained from the MATLAB for a given input.
Analysis of the Result
Based on the results from tables 4.2 to 4.17 the following analysis is done.
Analysis of synthesis results (LUTs & DSP blocks consumed by different complex multiplier approaches with FPGA Cyclone II)
Total number of Lookuptables (LUTs) used is (2W+1)2 for first approach 1. This equation is valid for (W= [3, 17]). Total number of Lookuptables (LUTs) used is 7W+5 for second approach. This equation is valid for (W= [3, 18]). Total number of Lookuptables (LUTs) used is 6W+4 for the third approach. This equation is valid for (W= [3, 18]). They are used during the addition operations. The number of (LUTs) is 4W for fourth approach and this equation is valid for (W= [5, 18]). There is a deviation from these equations above the specified widths. For the higher values of W the unusual increase in LUT consumption still remains mystery.
The number of multiplications is 4 in normal multiplier approach and 3 in other three approaches. That means each multiplication operator is implemented by 1 DSP block for W= [3, 9] and 2 DSP blocks for W= [10, 18]. For the higher values of W the unusual increase in DSP block consumption still remains unknown.
Analysis of synthesis results (LUTs & DSP blocks consumed by different Multiplier approaches with FPGA Stratix III)
Total number of Lookuptables (LUTs) used in first approach is (2W+1)2 for W=[19,36]; in second approach, it is 7W+7 for W=[5,7] and 7W+5 for W= [8, 36]; for the third approach, it is not given by a particular function as the FPGA is implementing some of the adder functionality in an extra DSP block and in fourth approach it is 3W for W=[5,36]. The LUTs are used during the addition operations.
The number of multiplications is 4 in normal multiplier approach and 3 in other three approaches.
As mentioned in the section 4.2, In Stratix III, DSP blocks not only do multiplications but they also do multiplication and accumulation operations. In the first approach, 1 DSP block is used for each multiplier implementation for W= [1, 17], 2 DSP blocks for W=18 and 4 DSP blocks for W= [19, 36]. In the second approach, 1 DSP block is used to implement each multiplier for W=[5,9], 2 DSP blocks are used to implement each multiplier for W= [17, 18] and 4 DSP blocks for W= [19, 36]. In the third approach, 4 DSP blocks are used to implement three multipliers for W= [3, 17], 8 DSP blocks for W=18 and 12 DSP blocks for W= [19, 34]. Here, DSP blocks are involved in implementing addition operations along with multiplications. In the fourth approach, 1 DSP block is used for each multiplier implementation for W= [5, 17], 2 DSP blocks for W=18 and 4 DSP blocks for W= [19, 36].
The synthesis results obtained were different from the values in the table 4.13 on few occasions with second multiplier approach using FPGA Stratix III. The different values obtained are shown below in table 4.19.This shows that the implementation is not same in all occasions and some functionality is moved from one resource (Either DSP block or LUTs) to other.
4.6.3 Analysis of synthesis results (before and after pipelining)
On synthesis of the complete FFT design for 64 point with 2 stages the values in Table 4.18 are obtained. By introducing pipelining the critical path is shortened thereby decreasing the data arrival time.
4.7 Conclusion of the project
Complex Multiplier block in the Radix 2^{2} FFT architecture has been implemented in different ways. And as stated above the fourth approach (Multiplier using tan values) uses fewer number of LUTs, DSP blocks when compared to other approaches. Our main goal in this project is to reduce the consumption of LUTs, DSP blocks and Registers by the architecture.
5. Problems faced during the course of the project
Problem in understanding FFT Architecture:
It took more time for us to understand the architecture. The problem includes the implementation and generation of the twiddle factors in between the stages. To get it worked it took long time to understand it properly.
Problem in testing:
As we used MODELSIM it was little not worthy to give inputs each and every time when we are testing individual blocks. But when we were testing for the complete system we have given input from a file, this has reduced a lot of work.
Finding Twiddle factor coefficient:
We had problem in finding the twiddle factor coefficients.
Here, twiddle factor coefficients can be calculated using
W_{N}^{nk}=e"^{j2TTnk/N}
Where, W_{N} are the twiddle factors
N is the size of the DFT In this project we had used MATLAB to find the coefficients.
e^{jx} = cosx  jsinx
Where x=j2TTnk/N;
The twiddle factor values were provided to the multiplier as two separate inputs real (cosx) and imaginary (sinx) values. When computing in the multiplier by using these values (real and imaginary values) for calculating the value of e, we haven't considered the negative sign as above expression
We have analyzed all the values coming from each stage and the solution of how to find these values is given in section 3.4.
6. Future Work
Twiddle factors can be implemented using VHDL to reduce the number of DSP blocks, LUTS and Registers instead of taking values directly from the MATLAB, the solution and the equations related to this work are discussed in section 3.4.
The Synthesis can be performed for different FPGAs.
This architecture can be implemented for higher frequencies.
A VHDL generator can be developed for the complete architecture.
More detailed analysis of synthesis results should be done. For example sometimes in table it has implemented with different LUTs and DSP blocks for the same multiplier approach with same FPGA.
Summary
Radix 2^{2} Algorithms has been chosen to implement and the different blocks in the architecture are analyzed and implemented. After implementing the blocks, the complex multiplier implementation has been done in four different approaches starting from the basic complex multiplication. Synthesis has been performed for each stage (Multiplier with Twiddle Factors) and the values of LUTs and DSP Blocks are taken accordingly. Analysis of the synthesis results has been done and presented in the report.
Bibliography
http://www.xilinx.com/company/gettingstarted/fpgavsasic.htm
2) Sousheng He, and Mats Torkelsson. "A New Approach to Pipeline FFT Processor". Department of Applied Electronics, Lund University, SWEDEN.
http://www.es.isy.liu.se/publications/papers_and_reports/1999/weidongl_ICSPAT99.pdf
E.H. Wold and A.M. Despain. Pipeline and parallelpipeline FFT processors for VLSI implementation. IEEE Trans. Comput.,C33(5):414426,May 1984.
G. Bi and E.V. Jones. A pipelined FFT processor for word sequential data. IEEE Trans. Acoust., Speech, Signal Processing, 37(12):19821985, Dec. 1989.
http://www.altera.com/products/devices/stratixfpgas/stratix/stratix/features/stxdsp.html
http://www.altera.com/literature/hb/stx3/stx3_siii51005.pdf
http://www.altera.com/literature/wp/wpstxvrtxII.pdf
Error analysis and complexity optimization for the multiplierless FFTlike transformation (MLFFT) Tsui, K.M. Chan, S.C. Tse, K.W. Dept. of Electr. & Electron. Eng., Hong Kong Univ., China.
Подобные документы
Строение класса complex. Примеры использования класса complex. Результат выполнения программы. Цикл возведения первого числа во второе. Операции с комплексными числами. Конструкторы и операции присваивания для типа complex. Неявные преобразования типов.
курсовая работа [1,5 M], добавлен 18.05.2011Микропроцессоры с архитектурой Complex Instruction Set Computers. Развитие архитектуры IA32 в семействе Pentium. Параллельные конвейеры обработки. Усовершенствованный блок вычисления с плавающей точкой. Технология динамического предсказания ветвлений.
презентация [220,4 K], добавлен 11.12.2013Изучение методов и этапов создания класса Complex, позволяющего работать с комплексными числами и производить с ними следующие операции: сложение, вычитание, умножение, деление двух комплексных чисел. Написание кода для ввода и вывода исходных данных.
курсовая работа [628,4 K], добавлен 11.09.2010Overview history of company and structure of organization. Characterization of complex tasks and necessity of automation. Database specifications and system security. The calculation of economic efficiency of the project. Safety measures during work.
дипломная работа [1009,6 K], добавлен 09.03.2015Сущность понятия "комплексное число". Умножение, деление и сложение. Класс Number и два последующих Rational Number Operation и Complex Number. Схема наследования классов исключений. Тестирование программы. Схема работы приложения. Интерфейс программы.
курсовая работа [607,3 K], добавлен 17.05.2015Program of Audio recorder on visual basic. Text of source code for program functions. This code can be used as freeware. View of interface in action, starting position for play and recording files. Setting format in milliseconds and finding position.
лабораторная работа [87,3 K], добавлен 05.07.2009Consideration of a systematic approach to the identification of the organization's processes for improving management efficiency. Approaches to the identification of business processes. Architecture of an Integrated Information Systems methodology.
реферат [195,5 K], добавлен 12.02.2016Technical methods of supporting. Analysis of airplane accidents. Growth in air traffic. Drop in aircraft accident rates. Causes of accidents. Dispatcher action scripts for emergency situations. Practical implementation of the interface training program.
курсовая работа [334,7 K], добавлен 19.04.2016Рождение и развитие Basic. Краткое описание Visual Basic for Applications. Новые возможности Visual Basic 5.0. Пример взаимодействия Excel и Visual Basic. Программирование табличных функций. Встраивание, применение функций. Формы, средства управления OLE.
реферат [20,7 K], добавлен 11.03.2010Язык программирования Visual Basic: краткая история возникновения, значение и общая характеристика. Изучение основных свойств Visual Basic, синтаксис языка. Обзор ключевых операторов Visual Basic, пользовательские процедуры и функции данного языка.
контрольная работа [36,4 K], добавлен 23.07.2014