StreamFIR — A Streaming FIR Filter Hardware Accelerator

Motivation

Many signal processing workloads are fundamentally streaming problems.
A CPU processes signals sequentially using instructions, memory loads, and branching. This introduces latency, wastes power, and limits throughput.

Filtering, however, is not a control problem — it is a repeated mathematical operation.

The core question of this ASIC hackathon is:

Where does hardware beat a CPU?

We chose digital filtering as a minimal but clear demonstration of hardware acceleration.


What the Project Does

StreamFIR is a 4-tap real-time FIR (Finite Impulse Response) filter implemented entirely in synthesizable Verilog.

The module continuously receives 8-bit samples and outputs a filtered signal every clock cycle.

Supported operating modes:

  • Bypass — direct signal output
  • Moving Average — smoothing filter
  • Weighted Low-Pass — noise reduction
  • High-Pass / Edge Detection — transition detection
  • User-Programmable Coefficients — custom filter behavior

The filter operates as a streaming hardware accelerator rather than a program.


Hardware Architecture

The design is a fully pipelined streaming datapath:

1. Delay Line

Stores the last four input samples and shifts every clock cycle.

2. Parallel Multiply-Accumulate (MAC)

Each sample is multiplied by a coefficient simultaneously.

[ y[n] = c_0x[n] + c_1x[n-1] + c_2x[n-2] + c_3x[n-3] ]

All multiplications occur in parallel in hardware.

3. Mode Controller

Selects preset filter behaviors or user-defined coefficients.

4. Register Interface

External logic can configure the filter without recompilation.

The pipeline produces one output sample per clock cycle.


Why Hardware Beats a CPU

CPU Implementation

  • Instruction fetch
  • Memory access
  • Sequential multiply operations
  • Scheduling overhead

StreamFIR Hardware

  • All multiplications happen simultaneously
  • No instruction overhead
  • Deterministic latency
  • Continuous processing

Result:

Metric CPU StreamFIR
Throughput Limited 1 sample / cycle
Latency Variable Constant
Power Efficiency Lower Higher
Timing Non-deterministic Deterministic

This project demonstrates that streaming DSP workloads map naturally to hardware pipelines.


Verification

We implemented a full ASIC-style verification flow:

  • Cocotb Python testbench
  • Directed tests for each mode
  • Random input testing
  • Impulse response testing
  • Mode-switching validation
  • Gate-level simulation

Waveforms were analyzed using GTKWave to confirm functional correctness.


Challenges

  • Handling signed arithmetic in RTL
  • Aligning pipeline latency with expected outputs
  • Debugging gate-level vs behavioral mismatches

What We Learned

  • Hardware parallelism vs CPU sequential execution
  • Streaming datapath design
  • FIR filter implementation in RTL
  • ASIC verification workflow
  • Why DSP is a classic hardware accelerator domain

Future Work

  • More taps (8-tap / 16-tap filters)
  • Audio input interface (I2S / ADC)
  • Equalizer implementation
  • Cascaded FIR/IIR filters
  • Real-time audio processing

Built With

  • Verilog
  • TinyTapeout SKY130 PDK
  • Icarus Verilog
  • Cocotb
  • Python
  • GTKWave

Built With

  • cocotb
  • icarus-verilog
  • python
  • tinytapeout-sky130-pdk
  • verilog
Share this project:

Updates