Complex-Multiplier Implementation for Resource Flexible

Radix-22 Algorithms has been chosen to implement and the different blocks in the architecture. After implementing the blocks, the complex multiplier implementation has been done in four different approaches starting from the basic complex multiplication.

Рубрика Программирование, компьютеры и кибернетика
Вид дипломная работа
Язык английский
Дата добавления 01.04.2010
Размер файла 4,3 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

60

39

Complex-Multiplier Implementation for Resource Flexible

Pipelined FFTs in FPGAs

Master thesis in Electronic Systems at Linkoping Institute of Technology by

Praneeth Kumar Thangella & Aravind Reddy Gundla

LiTH-ISY-EX-09/4155-SE

Supervisor & Examiner: Oscar Gustafsson Division of Electronic Systems, Dept. of Electrical Engineering Linkoping, 27 January 2009

Presentation Date 27-01-2009

Publication Title

Complex Multiplier Implementation for Resource Flexible Pipelined FFTs in FPGAs

Authors

Praneeth Kumar Thangella & Aravind Reddy Gundla

Abstract

Different approaches for implementing a complex multiplier in pipelined FFT are considered and implemented to find an efficient one in this project. The implemented design is synthesized on Cyclone II and Stratix III to know the performance. The design is implemented with a focus of reducing the resources used. Some approaches resulted in the reduced number of DSP blocks and others resulted in reduced number of LUTs. Analysis of Synthesis results is performed for different widths (bit lengths) of complex multiplier approaches.

Keywords

VHDL, FFT, FPGAs, complexmultiplier, LUT, DSP block, utilization and twiddle factors

Acknowledgement

Our sincere thanks to our examiner and supervisor Oscar Gustafsson for giving us such an interesting project, guiding and helping us whenever required from the start to the end of the project.

And we thank Kent Palmkvist for helping us during Synthesis and VHDL

Notations

DFT- Discrete Fourier Transform. FFT - Fast Fourier Transform. FPGA - Field Programmable Gate Array. R2MDC - Radix-2 Multi-path Delay Commutator. R2SDF -Radix-2 Single-path Delay Feedback. R4SDF- Radix-4 Single-path Delay Feedback. R4MDC - Radix-4 Multi-path Delay Commutator. R4SDC - Radix-4 Single-path Delay Commutator. R22SDF - Radix-22 Single-path Delay Feedback.

VHDL- Very High Speed Integrated Circuits Hardware Description Language.

DSP - Digital Signal Processing.

LUT - Look up Tables.

N - n-point DFT.

W- width of input data.

Table of Contents

Abstract

Acknowledgement

Notations

List of figures in the report

List of Tables in the report

List of Graphs in the report

1. Introduction

DFTs, FFTs, its advantages and applications in FPGAs

DFT Algorithm

Theme of the Report

Content of the document

2. Basic Pipeline Architectures

Introduction

Pipeline Architectures

R2MDC (Radix-2 Multi-path Delay Commutator):

R2SDF (Radix-2 Single-path Delay Feedback)

R4SDF (Radix-4 Single-path Delay Feedback)

R4MDC (Radix-4 Multi-path Delay Commutator)

R4SDC (Radix-4 Single-path Delay Commutator)

2.3 Comparison of Different architectures

3. Radix 22 FFT architecture

Introduction

Working of R22FFT architecture

Working of Butterfly Structures:

Calculation of twiddle factors in MATLAB:

4FFT Design

Introduction

Information on the FPGAs used in synthesis

Complete flow of the project

Implementation of Complex Multiplier block and Stage

First approach (Normal Complex Multiplier) of implementing complex multiplier

Second approach of implementing complex multiplier

Third approach of implementing complex multiplier

Fourth Approach of implementing complex multiplier

Tools and languages used for the implementation of the project

Testing

Analysis of the Result

Analysis of synthesis results (LUTs & DSP blocks consumed by different complex multiplier approaches with FPGA Cyclone II)

Analysis of synthesis results (LUTs & DSP blocks consumed by different Multiplier approaches with FPGA Stratix III)

Analysis of synthesis results (before and after pipelining)

4.7 Conclusion of the project

5. Problems faced during the course of the project

Problem in understanding FFT Architecture

Problem in testing

Finding Twiddle factor coefficient

. Future Work

Summary

Bibliography

List of Tables in the report

Table 2.1: Comparison of Different FFT Pipeline Algorithms

Table 4.1: Shows the stage name and number along with the size of the DFT

Table 4.2: Table represents the consumed Data Arrival Time, LUTS and DSP blocks for different stages for First Approach (Normal Complex Multiplication) with FPGA Cyclone II

Table 4.3: Table represents the consumed Data Arrival Time, LUTS and DSP blocks for different stages for First Approach (Normal Complex Multiplication) with FPGA Stratix III

Table 4.4: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with second approach with FPGA Cyclone II

Table 4.5: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with second approach with FPGA Stratix III

Table 4.6: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with third approach with FPGA Cyclone II

Table 4.7: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with third approach with FPGA Stratix III

Table 4.8: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with fourth approach with FPGA Cyclone II

Table 4.9: Table represents the data consumed for Data Arrival Time, LUTS and DSP blocks for different stages with fourth approach with FPGA Stratix III

Table 4.10: Shows the number of Data Arrival Time (ns*10-1), LUTS, DSP Blocks with respect to the bit widths of the First Approach of Complex Multiplier Implementation with FPGA Cyclone II

Table 4.11: Shows the number of Data Arrival Time (ns*10-2), LUTS, DSP Blocks with respect to the bit widths of the first Approach of Complex Multiplier Implementation with FPGA Stratix III

Table 4.12: Shows the number of Data Arrival Time (ns*10-1), LUTS and DSP Blocks consumed with respect to Bit Widths of the Second Approach of Complex Multiplier Implementation with FPGA Cyclone II

Table 4.13: Shows the number of Data Arrival Time (ns*10-1), LUTS and DSP Blocks consumed with respect to Bit Widths of the Second Approach of Complex Multiplier Implementation with FPGA Stratix I II

Table 4.14: Shows the number of Data Arrival Time (ns*10-1), LUTS and DSP Blocks consumed with respect to Bit Widths of the Third Approach of Complex Multiplier Implementation with FPGA Cyclone

Table 4.15: Shows the number of Data Arrival Time (ns*10-1), LUTS and DSP Blocks consumed with respect to Bit Widths of the Third Approach of Complex Multiplier Implementation with FPGA Stratix

Table 4.16: Shows the number of Data Arrival Time (ns*10-1), LUTS and DSP Blocks consumed with respect to Bit Widths of the Fourth Approach of Complex Multiplier Implementation with FPGA Cyclone

Table 4.17: Shows the number of Data Arrival Time (ns*10-1), LUTS and DSP Blocks consumed with respect to Bit Widths of the Fourth Approach of Complex Multiplier Implementation with FPGA Stratix

Table 4.18: Number of I/Os, LUTs, DSP blocks, Registers, Memory Bits and Data Arrival Time of a 64 - point Radix -22 FFT before and after pipelining using registers

Table 4.19: Values which are different from table 4.13

Table 4.20: Corresponding Values from table 4.13

List of Graphs in the report

Graph 4.1: Number of LUTs consumed for different stages with different Complex Multiplication approaches with FPGA Cyclone II

Graph 4.2: Number of LUTs consumed for different stages with different Complex Multiplication approaches with FPGA Stratix III

Graph 4.3: The Data Arrival Time (in ns) required for different stages with different Complex Multiplication approaches with FPGA Cyclone II

Graph 4.4: The Data Arrival Time (in ns) required for different stages with different Complex Multiplication approaches with FPGA Stratix III

Graph 4.5: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of First Approach of Complex Multiplier versus Bit Widths (X-axis) with FPGA Cyclone II

Graph 4.6: Plots the Data Arrival Time (ns*10-2), LUTs and DSP blocks of First Approach of Complex Multiplier versus Bit Widths (X-axis) with FPGA Stratix III

Graph 4.7: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Second Approach of Complex Multiplication versus Bit Widths (X-axis) with FPGA Cyclone II

Graph 4.8: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Second Approach of Complex Multiplication versus Bit Widths (X-axis) with FPGA Stratix III

Graph 4.9: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Third Approach of Complex Multiplication versus Bit Widths (X-axis) with FPGA Cyclone II

Graph 4.10: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Third Approach of Complex Multiplication versus Bit Widths (X-axis) with FPGA Stratix III

Graph 4.11: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Fourth Approach of Complex Multiplication versus Bit Widths (X-axis) with FPGA Cyclone II

Graph 4.12: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Fourth Approach of Complex Multiplication versus Bit Widths (X-axis) with FPGA Stratix III

1. Introduction

DFTs, FFTs and FPGAs

FFT is an algorithm to compute Discrete Fourier Transforms (DFT) and its inverse applications. FFT has many applications, in fact any field of physical science that uses sinusoidal signals, such as engineering, physics, applied mathematics, and chemistry. FFT computations play a vital role in the computations of FPGAs. FPGAs are faster to manufacture and time to market is less http://www.xilinx.com/company/gettingstarted/fpgavsasic.htm. FPGAs have simpler design cycle due to the software that handles most of the routing, placement and timing. With increase in the logic density and other features such as DSP blocks, Clocking and high-speed serial at lower price points, FPGAs are suitable for any type of design. A new bit stream can be uploaded remotely into the FPGAs. As there is need for the FPGAs to run faster and efficient, our project deals with improving the efficiency of the FPGAs by improving the FFT. The main applications of this project can be in OFDM Transceivers and Spectrometers.

DFT Algorithm

DFT has played a key role in the development of Digital Signal Processing concepts.

DFT transform can be defined as

X(k) = In=ox(n)I/Knnk where k e {0,N-1}

Where l/l/w=e"2nj/N is the Nth root of unity.

The inverse DFT can be defined as

X(k) =I^=o http://www.xilinx.com/company/gettingstarted/fpgavsasic.htmx(n)Wn-nk n = {0, N-1}

Where X(k) and x(n) are complex sequences of length N. The computation of DFT and IDFT requires N2 complex arithmetic operations.

Radix-2 DIF and DIT FFT algorithm definitions:

i) The decimation-in-time (DIT) radix-2 FFT recursively partitions a DFT into two half-length DFTS of even-indexed and odd-indexed time samples. The outputs of these shorter FFTs are reused to compute many outputs, thus greatly reducing the total computational cost.

1.3 Theme of the Report

After considering all the Pipeline FFT architectures we had concluded to implement Radix-22 DIF FFT algorithm, as it is one of the best and efficient algorithm to compute DFT. We have implemented complex multiplier in different ways and after synthesis we have analyzed the variation on the consumption of the DSP blocks, LUTS and Registers.

Content of the document

This part describes the contents of the chapters in this report.

Chapter 1 Introduction describes the general introduction of the DFTs and FFTs .It also includes the main goal of this project and describes the contents in each chapter in this report.

Chapter 2 Basic Pipeline Architectures describes the different pipeline architectures that are considered for choosing the Architecture for this project. And it also gives the comparison of different architectures. This chapter also covers why Radix-22 FFT architecture is chosen.

Chapter 3 Radix -22 FFT Architecture describes the architecture and working of Radix-22 FFT architecture. It also contains the calculation of twiddle factor values using the MATLAB.

Chapter 4 FFT Design describes the way in which Radix-22 architecture is implemented in this project. This chapter discussion includes how the project is carried out from the starting phase to the end of the project. This chapter also includes the workflow, tools used, methods adopted for the project and the results obtained.

Chapter 5 Problems faced describes the problems faced during the project.

Chapter 6 Future Work describes the work that can be done in extension to this project and other things that can be considered to be implemented in the future.

Chapter 7 Summary describes the summary of the complete thesis.

Chapter 8 Bibliography describes the references that have been used in writing this document.

2. Basic Pipeline Architectures

Introduction

This chapter introduces few pipeline architectures and comparison between them.

Pipeline Architectures

2.2.1 R2MDC (Radix-2 Multi-path Delay Commutator):

The input sequence Sousheng He, and Mats Torkelsson. "A New Approach to Pipeline FFT Processor". Department of Applied Electronics ,Lund University, SWEDEN. is divided into two streams and then given as inputs to butterflies. This architecture1 requires log2N-2 multipliers, log2N radix 2 butterflies and 1.5N-2 registers. The butterflies and multipliers have 50% utilization. In each stage half the data is delayed through memory and the other half data processes through butterflies. Thereby multipliers are utilized to 50%.The delay in each stage is N/2,N/2,N/4,N/8,,2,respectively http://www.es.isy.liu.se/publications/papers_and_reports/1999/weidongl_ICSPAT99.pdf.

+ 8 >

>

> 4 >

>

>

C2

>

BF2

*<8>> 4 >

C2

>

BF2

-Kgy>_2}*

C2

-->

BF2

j

C2

--

--

BF2

Figure 2.1: R2MDC (Radix-2 Multi-path Delay Commutator) for N= 16

2.2.2 R2SDF (Radix-2 Single-path Delay Feedback):

The basic operation of Radix-2 Single-path Delay Feedback E.H. Wold and A.M. Despain. Pipeline and parallel-pipeline FFT processors for VLSI implementation. IEEE Trans. Comput.,C-33(5):414-426,May 1984. architecture is , the first inputs up to Ki are shifted to the shift registers with the length of 2n-1. Later the incoming data will be combined with the data from the shift register using the 2 -point DFT.

Li = Ki + Ki+N/2

Li+N/2 = Ki - Ki+N/2where i = 0,,N/2-1

The values Li are sent to the rotor to apply the twiddle factors, whereas the values Li+N/2 sent back to shift registers. After all the N/2 point DFTs are computed the values Li+N/2 are sent out of the shift registers to rotor ,while the first half of the next transform is loaded in to the shift

Figure 2.2: R2SDF (Radix-2 Single-path Delay Feedback) for N= 16

2.2.3 R4SDF (Radix-4 Single-path Delay Feedback):

Radix-4 Single-path Delay Feedback method is the same as R2SDF, but with three delay lines per butterfly instead of one. In this architecture 3 out of 4 radix 4 butterfly outputs are stored; thereby the utilization of

multipliers increases to 75% and the adders only utilized to 25%. This needs1 log4N -1 complex

Multipliers, 8log4N complex adders and N-1 shift registers. The delay for each stage is 3N/4, 3N/16,,3 respectively2.

Figure 2.3: R4SDF (Radix-4 Single-path Delay Feedback) for N=256

2.2.4 R4MDC (Radix-4 Multi-path Delay Commutator):

Radix-4 Multi-path Delay Commutator1 architecture has utilization of 25% of all the components. It requires1 3 log4 N multipliers, log4 N full radix-4 butterflies and 5/2N - 4 registers. The first stage needs a 3N/2 word memory and the rest of stage need 3N/4,3N/16,,12 words2

Figure 2.4: R4MDC (Radix-4 Multi-path Delay Commutator) for N=256

2.2.5 R4SDC (Radix-4 Single-path Delay Commutator):

it is implemented with log4N stages with each stage having a multiplier, Commutator and a butterfly element .Each commutator in this method has 6 shift registers and 3 multiplexers G. Bi and E. V. Jones. A pipelined FFT processor for wordsequential data. IEEE Trans. Acoust., Speech, Signal Processing, 37(12):1982-1985, Dec. 1989. .This also reduces the requirement on memory.

R4SDC contains1 log4N-1 multipliers, 3log4N adders and 2N data memories thereby having 100 % efficiency in adders and 75 % efficiency in multipliers4. This architecture can be used for mixed and uniform radix multiplications. Significant savings have been achieved in this model with provided controllable adder/subtractor within the butterfly element and a few associated

control signals. The number of words which are necessary to be stored is 3N/2,3N/8,6

Figure 2.5: R4SDC (Radix-4 Single-path Delay Commutator) for N= 256

2.3 Comparison of Different architectures

When comparing different pipeline architectures, considering the number of Multipliers, Adders, Memory Size and controlling of the algorithms, R22 FFT architecture is the best with less number of Multipliers, Adders and Memory Size and is easy to control. So after considering all the advantages of this architecture, in our project we decided to implement this architecture.

3. Radix 22 FFT architecture

3.1 Introduction

This chapter discusses the working of the Radix 22 FFT architecture and includes a discussion on calculation of twiddle factor values in MATLAB.

Working of R22FFT architecture

As FFT is efficient algorithm to calculate DFT and its inverse.

The basic equation of DFT with size N can be defined as

X(k) = In=ox(n)I/Kwnkwhere 0 < k < N (3.1)

Here WN represents the primitive root of unity. And considering the first 2 steps of decomposition in the radix-2 DiF FFT

R22 FFT1 algorithm can be derived in the following way:

Applying the 3-dimensional linear index map

n= <N/2n1+N/4n2+n3> N

k=<k1 + 2k2 + 4k3>N (3.2)

By using the common factor algorithm (CFA) to the basic DFT equation gives X (k1 + 2k2 + 4k3)

(N/2n1+N/4n2+n3)(k1+2k2+4k3)= Y4 1 I' _oIn3=ox(N/2n1 + N/4n2 + n3)^v

n1=0 2

N 1

V1 rro k1w,... (N/4n2+n3)(k1) (N/4n2+n3)( 2k2+4k3) ln=o((Bn/2 )(N/4n2+n3) Wn 2 3 1) Wn 2 3 2 3 '3=0 2

And the butterfly structure has the form

BN/2K1(N/4n2+n3) = X(N/4n2+n3) + (-1)K1X(N/4n2+n3 +N/2)

Decomposing the composite twiddle factor in equation (c) gives

(N/4n2+n3). (Nn2k3)... N/4n2( k!+2k2l n3( ki+2k2) 4n3k3 = Wn WnWn Wn

n2( k,+2k2) n3( k,+2k2) 4n3k3 = (-j) 2 1 2 Wn 31 1 2 Wn 3 3 (3.4)

Substituting the equations (d) in (c) and expanding the summation. After simplifying the equations we have a set of 4 DFTS of length N/4.

X (k1 + 2k2 + 4k3) = U_0[H(k1,k2,n3)Wnn3( k1+2k2) ]Wn/44n3k3 (3.5)

Where

k(k +2k ) k

H (k1,k2,n3) = [x(n3)+(-1) 1x(n3+N/2)] + 1 2'[x(n3+N/4)+(-1) 1x(n3+3/4N)] (3.6)

x(n3)+(-1) 1x(n3+N/2)] = BFI (3.7)

x(n3+N/4)+(-1) 1x(n3+3/4N) = BFI (3.8)

k(k +2k ) k

x(n3)+(-1) 1x(n3+N/2)] + (-j) 1 2 [x(n3+N/4)+(-1) 1x(n3+3/4N) = BFII (3.9)

Equation (f) represents the first 2 stages of the butterflies.

After these stages of BFI and BFII, multipliers are required to calculate the decomposed Twiddle factors (WNn3( k1+2k2) ) the equation 3.5. Applying the Common Factor Algorithm to the remaining DFTs of length N/4 , complete Radix 22 FFT algorithm can be obtained .

3.3 Working of Butterfly Structures:

On the first N/2 clock cycles the 2-to-1 multiplexers in the first butterfly switch to position zero and the input data from the left is filled into the shift registers until they are filled and the butterfly will be in idle stage. On the next N/2 clock cycles the multiplexers turn to position 1 and the first butterfly calculates the 2-point DFT to the input data and the data coming from the shift registers.

Z1(n) = x(n)+x(n+N/2)

Z1(n+N/2)=x(n)-x(n+N/2)0 < n < N/2

The output from the first butterfly (Z1(n)) is sent to apply the twiddle factor and Z1(n+N/2) is sent back to the shift registers that is multiplied in still next N/2 cycles when the first half of the next frame of time sequences is loaded in. Second Butterfly is similar to that of the first one except the twiddle factor implementation and distance of butterfly input sequence are N/4.

j(WNN/4) is used to do the twiddle factor multiplication. Twiddle factor multiplication has been implemented to do real-imaginary swapping which can be done by a commutator and controlled add/subtract operations. Synchronizing counter is used to control the add/subtract operations and the logic gate. Further processing repeats this pattern with the distance of the input data decreases by half at each consecutive butterfly stages. After N-1 clock cycles th DFT transform comes from the output in bit reversed order. Due to the pipelined processing of each stage the next frame of transform can be computed without any pause.

X(0)

X(8)

X(4)

X(12)

X(2)

X(10)

X(6)

X(14)

X(1) X(9) X5)

X(13) X(3 )

X(1 1)

X(7)

X(15)

3.4 Calculation of twiddle factors in MATLAB:

Twiddle factor coefficients can be calculated using

nk -j21Jnk/N WN =e

Where, WN are the twiddle factors

N is the size of the DFT For example, e jx =cosx-jsinx Here in this case x=-2TTnk/N;

Twiddle factor coefficients are calculated from MATLAB.0*area 1*area 2*areaN/4-1* area

From the figure 3.4, the twiddle factor values are W , W , W W

They repeat 4 times and area =0 for first N/4 cycles, area=2 for second N/4 values, area=1 for third N/4 values and area=3 for last N/4 values.

From the figure 3.3 if we consider for 64-Point there are 2 stages that need twiddle factor values. Here in the equations 3.13, 3.14 we have used variable "s" which represents the stage.

Index is the value of nk from equation 3.10

The value of index was calculated by using the values of area, stage and mul. mul is the variable that is used to calculate the index

mul= N/g (3.12) Variable g is used to calculate the index.

g=N/ (2* (2*s)) (3.13) Finally index is calculated by equation

index =(mod(i,(g/4))) * mul * area. (3.14) Real values of the twiddle factors are calculated by using the cosine function. Real values = cos (2*TT*index/N). (3.15) Imaginary values are calculated by using the -sine function.

Imaginary values = =-sin (2*TT*index/N). (3.16) The real and imaginary values are written into a file using MATLAB.

The future work can be finding the above mentioned variables in VHDL to get twiddle factors instead of calculating them in MATLAB. Figure 3.5 shows how this implementation looks like.

In the equation 3.16 assuming k = 2*TT*index/N.

k

sin(k) or cos(k)

w

w

__ w

Figure 3.5: Future work that can be implemented

4. FFT Design

Introduction

This chapter describes the implementation of FFT and discusses the different complex multiplier approaches along with the analysis and presentation of the results.

Information on the FPGAs used in synthesis

FPGA's are finding an extensive application in the field of Digital signal processing. They consist of DSP blocks, LUT's, Register's and I/O's in order to implement a logic function. In this project synthesis is performed on two devices with very different features. Cyclone II devices are suited for multiplier intensive low cost DSP applications and Stratix III devices http://www.altera.com/products/devices/stratix-fpgas/stratix/stratix/features/stx-dsp.html are used for architecturally advanced, high performance and low power FPGAs. EP2C35F672C from Cyclone II and EP3SEE50F780C from Stratix III family are selected for implementing the FFT in the project.

From the Synthesis results obtained, EP2C35F672C from Cyclone II family consists of 475 IO's, 33216 LE's and 473 Kbit RAM and 70 DSP blocks (9-bit elements). LE is implemented with a LUT and a flip-flop (Register).

DSP blocks in this device have embedded multipliers for enhancing performance of the FPGA thereby reducing the need for other resources. In this Project, they are used for implementing multipliers.

LUT's along with Registers are the basic blocks in cyclone II FPGA. LUTs are used during the implementation of addition operation in the FPGA.

Register is a group of flip flop used to latch and store data. In this project they are used for introducing pipelining and also for implementing shift registers.

From the synthesis results obtained, EP3SEE50F780C from Stratix III family consists of 488 IO's, 38000 LE's and 5328 Kbit RAM and 384 DSP blocks (18-bit elements or 18x18 multipliers). LE is implemented with a LUT and a flip-flop (Register).

DSP blocks http://www.altera.com/literature/hb/stx3/stx3_siii51005.pdf of Stratix III devices consists of dedicated elements for performing multiplication, addition, subtraction, accumulation, summation, and dynamic shift operations and are ideally suited for complex systems which require a large number of mathematical computations. "Stratix DSP blocks http://www.altera.com/literature/wp/wpstxvrtxII.pdf are more than simple multipliers; each DSP has configuration capabilities to

LUTs in this device are used during the implementation of addition operation in the FPGA.

Registers in this device are used for introducing pipelining and also for implementing shift registers.

4.3 Complete flow of the project

We have started with implementing all the blocks in VHDL .Each individual blocks have been combined together and working of the complete architecture block is tested .After implementing all the individual blocks, pipelining is done by introducing registers to the architecture, to reduce the critical path and to increase the performance of the FFT. Complex Multiplier is implemented in different ways to figure out the number of DSP blocks and LUTs that are consumed for each approach.

4.3.1 Implementation of Complex Multiplier block and Stage

4.3.1.1 Implementation of the Normal Complex multiplier

Considering two complex numbers

A=ar+j*ai and B=br+j*bi

The product of these two complex numbers can be given as

A*B= (ar*br - bi*ai) + j*(ar*bi + ai*br)

Figure 4.1 shows the general complex multiplication of two complex numbers. Here in the figure, A and B are two complex numbers and the output is given separately for real and imaginary parts.

4.3.1.2 Implementation of a Stage

BF2_re BF2_im

Multiplier

A

C>out_re

out_im

tw_re

tw_im

Figure 4.2: Block diagram representing the blocks in a stage

The above figure 4.2 shows the components in each stage. Where, BF2_re (real) and BF2_im (imaginary) in each stage are the inputs from the butterfly2. Tw_re (real) and tw_im (imaginary) are inputs from twiddle factors which are calculated from MATLAB and stored in the memory.

The table 4.1 shows the name of the stage along with the size of the DFT

4.3.2 First approach (Normal Complex Multiplier) of implementing complex multiplier

The multiplication A*B can be implemented in a straight forward way by using four multipliers and two adders.

The table below shows the number of LUTS, DSP Blocks and data arrival time that are consumed for each stage for different number of points (size of DFT). Here we had considered for 1024, 256 and 64 points.

The below given synthesis results in tables 4.2 and 4.3 are with FPGAs Cyclone II, Stratix III respectively with first approach of complex multiplication.

4.3.3 Second approach of implementing complex multiplier

DSP is multiplication intensive technology. So it is worthy to save number of embedded multipliers or DSP blocks.

The multiplication A*B can also be implemented by using 3multipliers and 5 adders as per the below equation. The only difference from normal computation is the utilization of sum of the twiddle factors (real and imaginary values) for the computation which reduces the number of adders.

Pr = ar * (br+bi) - (ar+ai) * bi Pi = ar * (br+bi) + (ai-ar) * br

A=ar+jai

B=br+jbi

re im

re im

ar

br

bi

The below given synthesis results in tables 4.4 and 4.5 are with FPGAs Cyclone II, Stratix III respectively with Second approach of complex multiplication.

4.3.4 Third approach of implementing complex multiplier

Here the inputs coming from the ROMs are added before it is given as input to the multiplier and is given as third input along with other two inputs. This reduces number of adders and the total implementation takes four adders and three multipliers.

60

39

Pr = ar*(br+bi)-(ar+ai)*bi

60

39

4.3.5 Fourth Approach of implementing complex multiplier

Pi = ar*(br+bi)+(ai-ar)*brIn this approach Error analysis and complexity optimization for the multiplier-less FFT-like transformation (ML-FFT) Tsui, K.M. Chan, S.C. Tse, K.W. Dept. of Electr. & Electron. Eng., Hong Kong Univ., China. the coefficient values tan Z/2 and Sin Z are calculated from the MATLAB. This approach still reduces the number of LUTs consumed and is implemented with 3 multipliers and 3 adders.

The below given synthesis results in tables 4.8 and 4.9 are with FPGAs Cyclone II, Stratix III respectively with fourth complex multiplier approach.The below given graphs in 4.1 and 4.2 shows the number of LUTs consumed for each stage with all the four complex multiplier approaches implemented on FPGAs Cyclone II, Stratix III respectively.

Graph 4.1: Number of LUTs consumed for different stages with different Complex multiplication approaches implemented on FPGA Cyclone II.

Graph 4.2 Number of LUTs consumed for different stages with different Complex multiplication approaches implemented on FPGA Stratix III.

Graph 4.3: The Data Arrival Time (in ns) required for different stages with different Complex multiplication approaches implemented on FPGA Cyclone II

Complex Multiplier(FirstApproach)

Complex Multiplier(SecondApproach)

Complex Multiplier(ThirdApproach)

Complex Multiplier(FourthApproach)

60

39

Graph 4.4: The Data Arrival Time (in ns) required for different stages with different Complex multiplication approaches implemented on FPGA Stratix III

From the graphs 4.1 to 4.4 it can be concluded that the fourth approach is the least complex with longest delay.

The below given table 4.10 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with first complex multiplier approach with FPGA Cyclone II.

Width(in bits)

Data Arrival Time(ns*10-1)

LUTS

DSP Blocks

2

53.82

26

0

3

73.27

14

4

5

76.51

22

4

7

78.55

30

4

The below given graph 4.5 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with first complex multiplier approach with FPGA Cyclone II.

Graph 4.5: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of First Approach of Complex multiplier versus Bit Widths (X-axis) implemented on FPGA Cyclone II

The below given table 4.12 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with second complex multiplier approach implemented on FPGA Cyclone II.

Width(in bits)

Data Arrival Time((ns*10-1))

LUTS

DSP Blocks

1

42.74

1

0

2

61.34

29

0

3

86.63

26

3

5

84.59

40

3

7

88.85

54

3

9

93.11

68

3

The below given graph 4.7 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with second complex multiplier approach implemented on FPGA Cyclone II.

Graph 4.7: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Second Approach of Complex multiplication versus Bit Widths (X-axis)

Implemented on FPGA Cyclone IIThe below given graph 4.8 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with second complex multiplier approach implemented on FPGA Stratix III.

The below given table 4.14 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with third complex multiplier approach implemented on FPGA Cyclone II.

Width(in bits)

Data Arrival Time(ns*101)

LUTS

DSP Blocks

1

42.74

1

0

2

65.24

28

0

3

86.30

22

3

The below given graph 4.9 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with third complex multiplier approach implemented on FPGA Cyclone II.

Graph 4.9: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Third Approach Complex Multiplier implemented on versus Bit Widths (X-axis) with FPGA Cyclone II

The Below given table 4.16 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with fourth complex multiplier approach implemented on FPGA Cyclone II.

Width(in bits)

Data Arrival Time (ns*10-1)

LUTs

DSP Blocks

2

77.19

9

0

3

161.28

16

3

5

149.31

20

3

7

152.37

28

3

9

155.43

36

3

The below given graph 4.11 shows the synthesis results with number of LUTS , DSP blocks and Data Arrival Time consumed for different bit widths with fourth complex multiplier approach implemented on FPGA Cyclone II.

Graph 4.11: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Fourth Approach Complex Multiplier Implementation versus Bit Widths (X-axis) implemented on FPGA Cyclone II

Graph 4.12: Plots the Data Arrival Time (ns*10-1), LUTs and DSP blocks of Fourth Approach Complex Multiplier Implementation versus bit Widths (X-axis) with FPGA Stratix III

The below given table 4.18 shows the Synthesis results of a complete 64 point Radix 22 FFT architecture implementation before and after pipelining with first multiplier approach using FPGA Cyclone II.

Tools and languages used for the implementation of the project

The project is implemented in VHDL. HDL Designer, MODELSIM and Precision RTL Synthesis are used for design creation, simulation and synthesis of the design. MATLAB is mainly used in computing the values of the twiddle factors.

Testing

Each individual block is tested separately with different input values. After all the individual blocks were tested successfully, they were implemented together. And values obtained from the implemented FFT architecture are compared with the values obtained from the MATLAB for a given input.

Analysis of the Result

Based on the results from tables 4.2 to 4.17 the following analysis is done.

Analysis of synthesis results (LUTs & DSP blocks consumed by different complex multiplier approaches with FPGA Cyclone II)

Total number of Look-up-tables (LUTs) used is (2W+1)2 for first approach 1. This equation is valid for (W= [3, 17]). Total number of Look-up-tables (LUTs) used is 7W+5 for second approach. This equation is valid for (W= [3, 18]). Total number of Look-up-tables (LUTs) used is 6W+4 for the third approach. This equation is valid for (W= [3, 18]). They are used during the addition operations. The number of (LUTs) is 4W for fourth approach and this equation is valid for (W= [5, 18]). There is a deviation from these equations above the specified widths. For the higher values of W the unusual increase in LUT consumption still remains mystery.

The number of multiplications is 4 in normal multiplier approach and 3 in other three approaches. That means each multiplication operator is implemented by 1 DSP block for W= [3, 9] and 2 DSP blocks for W= [10, 18]. For the higher values of W the unusual increase in DSP block consumption still remains unknown.

Analysis of synthesis results (LUTs & DSP blocks consumed by different Multiplier approaches with FPGA Stratix III)

Total number of Look-up-tables (LUTs) used in first approach is (2W+1)2 for W=[19,36]; in second approach, it is 7W+7 for W=[5,7] and 7W+5 for W= [8, 36]; for the third approach, it is not given by a particular function as the FPGA is implementing some of the adder functionality in an extra DSP block and in fourth approach it is 3W for W=[5,36]. The LUTs are used during the addition operations.

The number of multiplications is 4 in normal multiplier approach and 3 in other three approaches.

As mentioned in the section 4.2, In Stratix III, DSP blocks not only do multiplications but they also do multiplication and accumulation operations. In the first approach, 1 DSP block is used for each multiplier implementation for W= [1, 17], 2 DSP blocks for W=18 and 4 DSP blocks for W= [19, 36]. In the second approach, 1 DSP block is used to implement each multiplier for W=[5,9], 2 DSP blocks are used to implement each multiplier for W= [17, 18] and 4 DSP blocks for W= [19, 36]. In the third approach, 4 DSP blocks are used to implement three multipliers for W= [3, 17], 8 DSP blocks for W=18 and 12 DSP blocks for W= [19, 34]. Here, DSP blocks are involved in implementing addition operations along with multiplications. In the fourth approach, 1 DSP block is used for each multiplier implementation for W= [5, 17], 2 DSP blocks for W=18 and 4 DSP blocks for W= [19, 36].

The synthesis results obtained were different from the values in the table 4.13 on few occasions with second multiplier approach using FPGA Stratix III. The different values obtained are shown below in table 4.19.This shows that the implementation is not same in all occasions and some functionality is moved from one resource (Either DSP block or LUTs) to other.

4.6.3 Analysis of synthesis results (before and after pipelining)

On synthesis of the complete FFT design for 64 point with 2 stages the values in Table 4.18 are obtained. By introducing pipelining the critical path is shortened thereby decreasing the data arrival time.

4.7 Conclusion of the project

Complex Multiplier block in the Radix 22 FFT architecture has been implemented in different ways. And as stated above the fourth approach (Multiplier using tan values) uses fewer number of LUTs, DSP blocks when compared to other approaches. Our main goal in this project is to reduce the consumption of LUTs, DSP blocks and Registers by the architecture.

5. Problems faced during the course of the project

Problem in understanding FFT Architecture:

It took more time for us to understand the architecture. The problem includes the implementation and generation of the twiddle factors in between the stages. To get it worked it took long time to understand it properly.

Problem in testing:

As we used MODELSIM it was little not worthy to give inputs each and every time when we are testing individual blocks. But when we were testing for the complete system we have given input from a file, this has reduced a lot of work.

Finding Twiddle factor coefficient:

We had problem in finding the twiddle factor coefficients.

Here, twiddle factor coefficients can be calculated using

WNnk=e"j2TTnk/N

Where, WN are the twiddle factors

N is the size of the DFT In this project we had used MATLAB to find the coefficients.

e-jx = cosx - jsinx

Where x=-j2TTnk/N;

The twiddle factor values were provided to the multiplier as two separate inputs real (cosx) and imaginary (sinx) values. When computing in the multiplier by using these values (real and imaginary values) for calculating the value of e, we haven't considered the negative sign as above expression

We have analyzed all the values coming from each stage and the solution of how to find these values is given in section 3.4.

6. Future Work

Twiddle factors can be implemented using VHDL to reduce the number of DSP blocks, LUTS and Registers instead of taking values directly from the MATLAB, the solution and the equations related to this work are discussed in section 3.4.

The Synthesis can be performed for different FPGAs.

This architecture can be implemented for higher frequencies.

A VHDL generator can be developed for the complete architecture.

More detailed analysis of synthesis results should be done. For example sometimes in table it has implemented with different LUTs and DSP blocks for the same multiplier approach with same FPGA.

Summary

Radix -22 Algorithms has been chosen to implement and the different blocks in the architecture are analyzed and implemented. After implementing the blocks, the complex multiplier implementation has been done in four different approaches starting from the basic complex multiplication. Synthesis has been performed for each stage (Multiplier with Twiddle Factors) and the values of LUTs and DSP Blocks are taken accordingly. Analysis of the synthesis results has been done and presented in the report.

Bibliography

http://www.xilinx.com/company/gettingstarted/fpgavsasic.htm

2) Sousheng He, and Mats Torkelsson. "A New Approach to Pipeline FFT Processor". Department of Applied Electronics, Lund University, SWEDEN.

http://www.es.isy.liu.se/publications/papers_and_reports/1999/weidongl_ICSPAT99.pdf

E.H. Wold and A.M. Despain. Pipeline and parallel-pipeline FFT processors for VLSI implementation. IEEE Trans. Comput.,C-33(5):414-426,May 1984.

G. Bi and E.V. Jones. A pipelined FFT processor for word sequential data. IEEE Trans. Acoust., Speech, Signal Processing, 37(12):1982-1985, Dec. 1989.

http://www.altera.com/products/devices/stratix-fpgas/stratix/stratix/features/stx-dsp.html

http://www.altera.com/literature/hb/stx3/stx3_siii51005.pdf

http://www.altera.com/literature/wp/wpstxvrtxII.pdf

Error analysis and complexity optimization for the multiplier-less FFT-like transformation (ML-FFT) Tsui, K.M. Chan, S.C. Tse, K.W. Dept. of Electr. & Electron. Eng., Hong Kong Univ., China.


Подобные документы

  • Строение класса complex. Примеры использования класса complex. Результат выполнения программы. Цикл возведения первого числа во второе. Операции с комплексными числами. Конструкторы и операции присваивания для типа complex. Неявные преобразования типов.

    курсовая работа [1,5 M], добавлен 18.05.2011

  • Микропроцессоры с архитектурой Complex Instruction Set Computers. Развитие архитектуры IA-32 в семействе Pentium. Параллельные конвейеры обработки. Усовершенствованный блок вычисления с плавающей точкой. Технология динамического предсказания ветвлений.

    презентация [220,4 K], добавлен 11.12.2013

  • Изучение методов и этапов создания класса Complex, позволяющего работать с комплексными числами и производить с ними следующие операции: сложение, вычитание, умножение, деление двух комплексных чисел. Написание кода для ввода и вывода исходных данных.

    курсовая работа [628,4 K], добавлен 11.09.2010

  • Overview history of company and structure of organization. Characterization of complex tasks and necessity of automation. Database specifications and system security. The calculation of economic efficiency of the project. Safety measures during work.

    дипломная работа [1009,6 K], добавлен 09.03.2015

  • Анализ аналогов информационно-справочной системы Laboratory of complex and atypical prosthetics. Требования к составу и содержанию работ по подготовке объекта автоматизации к вводу системы в действие. Автоматическое обновление каталогов продукции.

    курсовая работа [4,0 M], добавлен 09.07.2023

  • Сущность понятия "комплексное число". Умножение, деление и сложение. Класс Number и два последующих Rational Number Operation и Complex Number. Схема наследования классов исключений. Тестирование программы. Схема работы приложения. Интерфейс программы.

    курсовая работа [607,3 K], добавлен 17.05.2015

  • Program of Audio recorder on visual basic. Text of source code for program functions. This code can be used as freeware. View of interface in action, starting position for play and recording files. Setting format in milliseconds and finding position.

    лабораторная работа [87,3 K], добавлен 05.07.2009

  • Consideration of a systematic approach to the identification of the organization's processes for improving management efficiency. Approaches to the identification of business processes. Architecture of an Integrated Information Systems methodology.

    реферат [195,5 K], добавлен 12.02.2016

  • Technical methods of supporting. Analysis of airplane accidents. Growth in air traffic. Drop in aircraft accident rates. Causes of accidents. Dispatcher action scripts for emergency situations. Practical implementation of the interface training program.

    курсовая работа [334,7 K], добавлен 19.04.2016

  • Рождение и развитие Basic. Краткое описание Visual Basic for Applications. Новые возможности Visual Basic 5.0. Пример взаимодействия Excel и Visual Basic. Программирование табличных функций. Встраивание, применение функций. Формы, средства управления OLE.

    реферат [20,7 K], добавлен 11.03.2010

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.