Loading

Sample menu:

Student Reports


Introduction to Cloud Computing Written by:Jonathan Parri PDF
Abstract []

Cloud computing does many things, but its main feature point has been the addressing of business needs. Business concerns come in the form of cost savings as ROI, simplified IT infrastructure management requirements and direct business process integration. Before considering cloud computing as a viable option, the main question any person should ask is, "can our company even use a cloud environment and will the company benefit from such a transition?"

Cloud computing is still in its infancy and throwing a company's IT needs directly into the fire has clear implications that must be addressed. Issues like security and standardization are at the forefront of the next feature-add to cloud computing. The current market trends forecast that cloud computing will continue its current growth pattern so it is only a matter of time before we see a large paradigm shift in how companies provide and use technology services. If cloud computing becomes so pervasive that the physical hardware computation platform becomes a thing of the past then cloud computing has succeeded in its ultimate goal of becoming the computer.


Embedded multiprocessing in data network applications ELG7199 Final Project Report Written by:Josip Popovic PDF
Abstract []

In this report, we look at reasons why ASIC data network solutions are being gradually replaced with more suitable technologies. An overview of NP architecture is given followed by a chapter on Tensilica processors and how they can be used as NP building blocks. Tensilica processors with their TIE queues are well equipped for the producer/consumer based designs.


GPU Architecture and Programming ELG7199 Final Project Report Written by:Bardia Bandali PDF
Abstract []

With discovering capabilities of Graphic Processors to solve complicated mathematical problems and parallel execution of thousands of threads, Graphic Processing Unit (GPU) became an interesting area to research. Rapid growth of 3D game market, performance improvements of GPU chips and their programmability during last decade boosted research activities in this field, and revealed more potential applications for GPU programming.

This report provides a survey on the emergence of GPUs, their architecture details and programming via step by step introducing key elements and modules involved in modern GPUs. Chapter one, introduces a history on graphics adapters and their evolution, and later on, explains the cause of scientific popularity of GPUs; general purpose computation on GPUs (GPGPU). In chapter two, a review on graphics structure including elements, functions and pipeline is explained. Then, chapter three discusses data parallelism, the methods, models and architecture required to handle it. In chapter four, modern GPUs architecture, its memory, and Cayman a new paradigm from AMD Company, is declared. Also, programming of GPUs will be investigated in chapter four. This would be done through discovering and classification of different layers of the AMD Software Stack; from highest level OpenCL, Computer Abstraction Language (CAL), Intermediate Language Interface (IL) to Instruction Set Architecture (ISA). Finally in chapter five, some limitations will be introduced and potential research opportunities for future will be presented.


Automatic Parallelization for Java ELG7199 Final Project Report Written by: Mathieu Thibault-Marois PDF
Abstract []

The approaches covered in this report have shown that parallelizing Java programs is possible, and viable, at both compile-time and run-time for both CPUs and GPUs, providing speed-up with minimal, if no, intervention for the programmer. The proof of concept has been done and future direction for parallelizing Java on these two hardware platforms will most likely take the form of making the process more efficient, more accurate, more adaptive and more integrated. The number of cores in systems only goes on to increase with no end in sight. As such, the benefits gained from parallelization only go on increasing for the foreseeable future.


Survey of Approaches for Successful Design and Programming of Multiprocessor Systems on Chip Written by:Krste Mitric, Miodrag Bolic, Voicu Groza PDF
Abstract []

Multiprocessor System on Chip (MPSoC) has become an ultimate solution for many applications with high processing requirements over the last couple of years. Multimedia, Network Processing, Gaming and Automotive are just some of examples where solutions using a single processor hit performance and power consumption wall, despite the trend of performance increase with newly emerging processors. Strict requirements for high processing power and low power consumption for some of these applications can only be met using heterogeneous MPSoC, with different types of processors targeting appropriate segments of the application. A correct interconnection topology with proper communication, including data exchange and synchronization, appears to be the key for meeting application requirements. Therefore, the interconnection among processing and memory elements has the largest weight in MPSoC design process. In this paper we discuss MPSoC design with emphasis on processors interconnection as the key for successful mapping of the parallelized application software onto multiple processing elements. We focus on issues, challenges and proposed solutions by leading experts from industry and academia in the area of MPSoC design methodology and automation. In addition to interconnection, we discus the design space exploration, types of selected processors that determine the system architecture, power consumption, parallelizing sequential software and mapping of the software onto MPSoC - issues that system architects and designers face when creating an MPSoC.


Signal Processing and General Purpose Computing on GPU Written by: Muge Guher PDF
Abstract []

Graphics processing units (GPUs) have been growing in popularity due to their impressive processing capabilities, and with general purpose programming languages such as NVIDIA's CUDA interface, are becoming the platform of choice in the scientific computing community. Today the research community successfully uses GPU to solve a broad range of computationally demanding, complex problems. This effort in general purpose computing on the GPU, also known as GPU computing, has positioned the GPU as a compelling alternative to traditional microprocessors in high-performance computer systems and DSP applications. GPUs have evolved a generalpurpose programmable architecture and supporting ecosystem that make it possible for their use in a wide range of non-graphics tasks, including many applications in signal processing. Commercial, high-performance signal processing applications that use GPUs as accelerators for general purpose tasks are still emerging, but many aspects of the architecture of GPUs and their wide availability make them interesting options for implementing and deploying such applications even though there are memory bottleneck challenges that have to be overcome in real time processing. This report describes the brief history, hardware architecture, and programming model for GPU computing as well as the modern tools and methods used recently. The application space of the GPGPU is explored with examples of signal processing applications and with results of recently conducted evaluations and benchmarks.


Fuzzy Logic Co-processor for Leon 3 ISELeon3 Final Report. Written by: Daniel Shapiro, Vishal Thareja, Saurabh Ratti, Srivatsan Vijayakumar, Muran Yang. Supervisor: Dr. Miodrag Bolic
Abstract []

The goal of our project is to automate the implementation of an ASIP control system, specifically a fuzzy logic control system. Our tool will accept a source program as input, and will export processor and co-processor VHDL to be programmed into an FPGA, along with assembly language for that processor to run.


Selective Automation of GNU Binutils Assembler and Linker for Custom SPARC-v8 Instruction Sets ELG7199 Final Project Report Written by: Jonathan Parri PDF
Abstract []

This paper looks to automate available tools through a Java interface while allowing for the customization of instructions. This customization is of particular importance to embedded system designers where application specific processors can be used. Embedded systems often have various memory architectures and must be supported as well. This customization becomes imperative when an OS employing virtual memory is not used.


Assembly Code to Memory Initialization File Generator for the SHIRA Tool Chain Focus on the Loader Sub-Project ELG 7199 Final Project Report. Written by: Michael Montcalm PDF
Abstract []

This paper presents the steps taken to design and implement an assembly code to memory initialization file converter. Specifically it details the conversion of the executable linked file format to that of the memory initialization file format.


FPGA Implementation of Data Stream Processing System Written By: Daniel Shapiro PDF
Abstract []

A configurable data stream processing system was implemented on an FPGA. This design is called QUAY-FPGA, and was based on the QUAY system. The design consists of configurable hardware written in the Handel-C language and compiled using the DK design suite. It will be demonstrated that query and data throughput was improved by managing memory access patterns and increasing execution parallelism. QUAY was implemented as a multithreading application and was executed on a Pentium-II. In contrast, QUAY-FPGA was implemented on an FPGA and connected to a block of RAM.


An Optimized Architecture for 2D Discrete Wavelet Transform on FPGAs Using Distributed Arithmetic Written By: Patrick Longa PDF
Abstract []

In the last few years, there has been a growing trend to implement DSP functions in Field Programmable Gate Arrays (FPGAs), which offer a balanced solution in comparison with traditional devices. Although ASICs and DSP chips have been the traditional solution for highperformance applications, now the technology and the market are imposing new rules. On one hand, high development costs and time-to-market factors associated with ASICs can be prohibitive for certain applications and, on the other hand, programmable DSP processors can be unable to reach a desired performance due to their sequential-execution architecture. In this context, FPGAs offer a very attractive solution that balance high flexibility, time-to-market, cost and performance.

Following this trend, the research community has focused in evaluating DSP functions on FPGAs to take advantage of the high level of parallelism that can be reached with these devices. In this sense, Discrete Wavelet Transform (DWT), a relatively new computing-intensive signal transform, has been explored and several architectures have been proposed to achieve a performance level that is difficult to reach with traditional PDSP devices.

Although wavelets have been around for a while in the area of mathematics, it has been until very recently that they were formally formulated and began to be extensively used. Just in 1986, a joint effort between Mallat and Meyer gave birth to multiresolution analysis with wavelets, and in 1988, based on this study, Daubechies discovered the most widely used and known family of wavelets, called after her the Daubechies Wavelets. Nowadays, wavelet transform is intensively used in speech, image and video processing, and in signal processing in general because of its attractive characteristics to represent non-stationary signals in both frequency and time domains. Researchers have switched from STFT (Short-Time Fourier Transform) to DWT because the former uses a fixed resolution at all times while the latter provides a variable resolution following the observed pattern on most applications: high frequency components of signals have a short duration while low frequency components have a long duration. Furthermore, DWT has been adopted by recent still image and video coding standards, JPEG2000 and MPEG-4, given its high performance for image and video compression showing superior results when compared to the traditionally used Discrete Cosine Transform (DCT).

For the present project, we have developed an innovative two-dimensional Discrete Wavelet Transform (2D-DWT) architecture for image compression applications based on the verywell known Distributed Arithmetic (DA) technique, which exploits the LUT-based FPGA structure to build multiplier-less filterbank, the main component in a DWT structure. With help of a special memory arrangement, the fully parallel DA-based 2D-DWT has been time multiplexed in such a way that the downsampling stage is seamlessly realized inside the filterbank structure. With this scheme, our implementation achieves twice the speed of traditional proposals because a resultant sample outputs every clock cycle. Furthermore, the latter is achieved with the input samples running at the same clock rate than the rest of the circuit.

The proposed DA-based architecture for the 2D-DWT has been implemented using 8-tap Daubechies filterbanks in Simulink (DSP Builder) and Altera Quartus II. Simulations with several images have been done in Matlab to evaluate performance. This new architecture is intended for image compression applications, specifically for JPEG2000, the new image-coding standard that has adopted DWT as main component.


Instruction Set Extension Identification Under Local, Global, and Solver Execution Time Constraints Written By: Daniel Shapiro, Saurabh Ratti, Miodrag Bolic PDF
Abstract []

The instruction set extension identification problem is the search for a set of custom instructions which can be added to a base processor. A sub-problem is to complete the ISE identification without searching for too long. One aspect of this area of research that needs improvement is the execution time of the sate of the art algorithms which are used to solve this problem. Thus far solutions to large problems are exact but return results too slowly. We propose a non-exact solution in the form of an integer linear program, and proceed to describe our toolchain. We show that heuristics can improve the execution time of the solver by finding an incumbent for each iteration of the ILP. We show the need for a mix of small instructions which are repeated often, and large instructions which although rare provide a large speedup. We propose a method for reducing the execution time of the solver by allowing the user to define a cutoff execution time between improvements in the solution, and a maximum execution time for the solver. Furthermore we define an objective function as the number of cycles saved due to the selection of custom instructions, because this measurement is additive. When identifying ISEs explain how merging identical ones takes advantage of common patterns in the code, and we show how to sum the speedup values now that they are additive. We include in the ISE identification problem some local and global constraints on code defined by the user.


Design Space Exploration: Comparative Study of Simulated Annealing and Particle Swarm Optimization Written By: Vishal Thareja PDF
Abstract []

As the complexity of designs implemented in embedded system increases, there needs to be some sort of exploration of hardware configurations to find the best design that will meet the design constraints. Design Space Exploration is an exploration technique should by many to find the best hardware configuration to fit in some design constraints. In this project, a popular algorithm, Simulated Annealing and a relatively new algorithm, Particle Swarm Optimization are analyzed and implemented to perform space exploration on a Fuzzy Logic Coprocessor. Results from both algorithms are observed and compared to suggest a more efficient algorithm. The Fuzzy Logic Coprocessor is a small design with very low complexity. This makes the analysis of Simulated Annealing and Particle Swarm Optimization much easier to understand and then with a strong foundation, the problem complexity can increase and more results can be gathered from the algorithms.


Implementation of an Oversampled Subband Acoustic Echo Canceler Written By: Brady Laska PDF
Abstract []

Acoustic echoes in a telephone system occur whenever the signal radiated by a telephone’s speaker is picked up by its microphone. If left uncanceled, the far-end talker hears a delayed version of their own voice. These echoes disturb the talker and reduce the naturalness of conversations. In extreme cases, if there is acoustic coupling at both ends of the connection, howling instability can result. Acoustic echoes frequently occur in hands-free cellular and teleconferencing scenarios, where the acoustic path between the speaker and the microphone is relatively unobstructed. An adaptive echo canceler is used to remove the echo from the microphone signal before it is sent to the far-end talker. The objective of this project is to implement a subband acoustic echo canceler running in real-time on a programmable DSP processor.


Implementation of Algorithms for QRS Detection from ECG Signals Using TMS320C6713 Processor Platform Written By: Geoffrey Green PDF
Abstract []

The electrocardiogram (ECG) provides a physician with a view of the heart’s activity through electrical signals generated during the cardiac cycle, and measured with external electrodes. Its clinical importance in cardiology is well established, being used for example to determine heart rate, investigate abnormal heart rhythms, and causes of chest pain.

Because the QRS complex is the major feature of an ECG, a great deal of clinical information can be derived from its features. Identification of this feature in an ECG is known in the literature as QRS detection, and it is a vital task in automated ECG analysis, portable arrhythmia monitoring, and many other applications. Though trivial in an “ideal” ECG, the range in quality of real-world ECG signals obtained from a variety of subjects under different measurement conditions makes this task much more difficult.


A Purely Fixed-Point Implementation of the FastICA Algorithm Written By: Nikola Rank PDF
Abstract []

The Fast ICA algorithm was introduced in 1997 by Aapo Hyvarinen and Erkki Oja, from the Helsinki University of Technology. The basic concept is to take a neural network leaning rule and convert it into a fixed-point iteration. This yields “an algorithm that is very simple, does not depend on any user-defined parameters, and is fast to converge to the most accurate solution allowed by the data.”

Two important applications of ICA are blind source separation, and feature extraction. This includes some very focused, interesting applications such as analysis of fMRI data, and fetal heart monitoring. ICA is still in its infancy and its applications are still growing.

This particular algorithm can be used in batch mode (processing all the data at once) or in a semi-adaptive manner, working with subsets of the data at a time. In one experiment comparing it to a neural network algorithm, FastICA required 10% of the floating point operations used by the neural algorithm.2 In the same paper, the convergence of the FastICA algorithm is proven to be cubic, much faster than other similar algorithms. Another important aspect is that FastICA can be used to only extract desired components, instead of having to extract them all at once(though this requires proper initialization of the unmixing matrix).

With these desirable features, FastICA is a good candidate for porting to a fixedpoint implementation. This report will first focus on the original floating-point method, and provide a set of tests/results for it. Then a modified version of the algorithm for fixed-point computation will be tested.


Multiprocessor System on Chip Architecture for Kalman Filter based Speech Enhancement Alrogithms Written By: Peter Farkas PDF
Abstract []

The purpose of this project was to find a multiprocessor architecture that is well suited for implementing a subset of speech enhancement algorithms. Once found it was implemented on an FPGA using Nios II processors, in addition to other components including memory blocks. The implemented architecture and the algorithm running on it are described in the following sections.


Real-Time Acoustic Echo Cancellation Written By: Trevor Burton PDF
Abstract []

This report discusses two acoustic echo cancellation algorithms that were implemented in real-time on a PDSP and compares them with previous implementations found in the literature. The algorithms are compared in terms of acoustic echo cancellation performance and computational complexity. The target PDSP that was used is the Texas Instruments TMS320C6713 floating-point device.


Programmable System-on-Chip Technology from Cypress Semiconductor Written By: Abdallah Ismail PDF
Abstract []

A typical embedded system application makes use of a small processor that coordinates execution and processing of data between peripherals. Program code is usually stored in on-chip flash memory, while data is stored and retrieved from on-chip RAM. A System-on-Chip (SOC)-based embedded system is one which uses configurable hardware surrounding a soft or hard processor core. The purpose of this report is to study the architecture of Programmable System-on-Chip (PSoC) from Cypress Semiconductor and compare it to the more conventional FPGA-based SoC architecture.


Parallel Implementation of Modified Rao-Blackwellised Particle Filter Written By: Suruz Miah, Miodrag Bolic PDF
Abstract []

Particle filter algorithm has become as an open and challenging problem over the last decades with respect to their computational complexity. Despite the significant advances in this field, researchers are yet to reach a satisfactory level of computational overhead. Usually, particle filtering methods are by nature computationally expensive when the dimension of the state vector is large, as it may be the case for speech processing application, tracking applications, and to name a few. In such cases, a large number of particles is typically required to achieve a performance that would justify the use of particle filtering. Rao-Blackwellised Particle Filters (RBPF) are alternative to PF algorithm that requires much less particles to achieve the similar performance but instead, it requires more computations per particle. The further improvement of RBPF (modified RBPF) is conducted by, where the fundamental difference from the generic RBPF lies in the way of propagating the particles from one iteration to the next iteration.

The main goal of this paper is to design and implement the modified RBPF algorithm using multiprocessing system in order to reduce the computational complexity and to find the optimal number of Processing Elements (PE).


Grid Computing using DE2 and NIOS II Written By: Jason Stoddart and Jean-Francois Bibeau. Reviewed by: Daniel Shapiro and Zahia Aidoud PDF
Abstract []

Grid computing is the way of the future for solving various complex problems that would be impossible for a single organization without access to a super computer. Grid computing allows a large problem to be divided into different sections of work that can be performed independently. A central server divides the problem and assigns a portion to each client that connects to it. When the client has finished its task, it will return the result to the server and receive another portion of the problem. This ideology allows many “low-budget” computers to be used the same way as a super computer. Another advantage is that this approach can make good use of unused CPU cycles that would otherwise be wasted.

We have decided to perform prime number calculations as our grid computing task. This problem was chosen as there is no need to have a result of the previous calculation to do the next calculation. This lends itself very well to dividing a problem up into multiple sections and allowing different processors to perform each section independently.

We have decided to compare grid computing both to a multi-core computer and to a single-core computer. These platforms will be using the Altera DE2 board and running the NIOSII SOPC. The multicore FPGA will represent a “super computer” and will consist of three cores, one for synchronization and the other two for prime number calculations. The single-core represents a “regular PC”, which will be used both for the sequential and grid computing approaches. The hope of this project is to illustrate the idea of grid computing: having access to many regular PCs will give more powerful processing ability than having access to one super computer.


uC/OS II vs uCLinux Written By: Peter McLean and Hala Ayoub. Reviewed by: Daniel Shapiro and Miodrag Bolic PDF
Abstract []

The uClinux port is a derivative of Linux kernel intended for microcontrollers without memory management units (MMU). It provides a single shared address space for all processes. Whereas, uC/OS-II is a portable, ROMable, scalable, preemptive, real-time deterministic multitasking kernel for microprocessors, microcontrollers and DSPs. In this paper, we implemented uCos and uClinux kernels on the same NIOS-II platform and compared the performance.