StreamFIR — A Streaming FIR Filter Hardware Accelerator

Motivation

Many signal processing workloads are fundamentally streaming problems.
A CPU processes signals sequentially using instructions, memory loads, and branching. This introduces latency, wastes power, and limits throughput.

Filtering, however, is not a control problem — it is a repeated mathematical operation.

The core question of this ASIC hackathon is:

Where does hardware beat a CPU?

We chose digital filtering as a minimal but clear demonstration of hardware acceleration.

What the Project Does

StreamFIR is a 4-tap real-time FIR (Finite Impulse Response) filter implemented entirely in synthesizable Verilog.

The module continuously receives 8-bit samples and outputs a filtered signal every clock cycle.

Supported operating modes:

Bypass — direct signal output
Moving Average — smoothing filter
Weighted Low-Pass — noise reduction
High-Pass / Edge Detection — transition detection
User-Programmable Coefficients — custom filter behavior

The filter operates as a streaming hardware accelerator rather than a program.

Hardware Architecture

The design is a fully pipelined streaming datapath:

1. Delay Line

Stores the last four input samples and shifts every clock cycle.

2. Parallel Multiply-Accumulate (MAC)

Each sample is multiplied by a coefficient simultaneously.

[ y[n] = c_0x[n] + c_1x[n-1] + c_2x[n-2] + c_3x[n-3] ]

All multiplications occur in parallel in hardware.

3. Mode Controller

Selects preset filter behaviors or user-defined coefficients.

4. Register Interface

External logic can configure the filter without recompilation.

The pipeline produces one output sample per clock cycle.

Why Hardware Beats a CPU

CPU Implementation

Instruction fetch
Memory access
Sequential multiply operations
Scheduling overhead

StreamFIR Hardware

All multiplications happen simultaneously
No instruction overhead
Deterministic latency
Continuous processing

Result:

Metric	CPU	StreamFIR
Throughput	Limited	1 sample / cycle
Latency	Variable	Constant
Power Efficiency	Lower	Higher
Timing	Non-deterministic	Deterministic

This project demonstrates that streaming DSP workloads map naturally to hardware pipelines.

Verification

We implemented a full ASIC-style verification flow:

Cocotb Python testbench
Directed tests for each mode
Random input testing
Impulse response testing
Mode-switching validation
Gate-level simulation

Waveforms were analyzed using GTKWave to confirm functional correctness.

Challenges

Handling signed arithmetic in RTL
Aligning pipeline latency with expected outputs
Debugging gate-level vs behavioral mismatches

What We Learned

Hardware parallelism vs CPU sequential execution
Streaming datapath design
FIR filter implementation in RTL
ASIC verification workflow
Why DSP is a classic hardware accelerator domain

Future Work

More taps (8-tap / 16-tap filters)
Audio input interface (I2S / ADC)
Equalizer implementation
Cascaded FIR/IIR filters
Real-time audio processing

Built With

Verilog
TinyTapeout SKY130 PDK
Icarus Verilog
Cocotb
Python
GTKWave

Built With

cocotb
icarus-verilog
python
tinytapeout-sky130-pdk
verilog

Updates

Jinning Liu started this project — Feb 22, 2026 06:18 AM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.