$ less ~/workspace/profile/config.yamlFOCUS: "learning so slow my blog can’t keep up"
KNOWS: "anything bw sand nd the entity thinking back at us rn"
BUILDING: "hw accel arch for img classif; it can sǝǝ john cena"
CLUB: "feeding robots my gpa @rignitc"
COMPS: ["e-yantra MB", "openpower hw"]
MAJOR: "electronics nd communication engg"
COLLEGE: "national institute of technology calicut"
CONTACT: "carrier pigeons :p"
HOBBIES: ["speedcubing", "movies", "4k day dreaming"]
FUNFACT: "i'm discount batman"
SECRET: "i pay the moon to follow me around"
WHATIF: "if i eat myself, do I become twice as big or disappear?"
$ ls ~/projects -filter=feat
(Click sections below to expand)
Digital/Physical Design/Verfication & Compute Architectures
INT8 Fixed-Point CNN Hardware Accelerator and Image-Processing Suite |
Link
INT8 Fixed-Point CNN Hardware Accelerator and Image-Processing Suite
• Designed a synthesizable shallow Res-CNN for CIFAR-10, Pareto-optimal among 8 CNNs for parameter memory, accuracy & FLOPs
• Built systolic-array PEs with 8-bit CSA–MBE MACs, FSM-based control, 2-cycle ready/valid handshake, and verified TB operation
• Performed PTQ/QAT (Q1.31→Q1.3) analysis; Q1.7 PTQ retained ∼84% accuracy (<1% loss) with 4x smaller (∼52kB) memory footprint
• Auto-generated 14 coeff & 3 RGB ROMs via TCL/Py automation; validated TF/FP32–RTL consistency and automated inference execution
• Implemented AXI-Stream DIP toolkit (edge, denoise, filter, enhance) with pipelined RTL & FIFO backpressure handling
• MLP classifier on (E)MNIST (>75% acc.) with GUI viz; Automated preprocessing & inference with TCL/Perl
3-Stage Pipelined Systolic Array-Based MAC Microarchitecture
• Benchmarked six 8-bit signed adders-multipliers via identical RTL2GDS Sky130 flow to isolate arithmetic-level post-route PPA trade-offs
• 3-stage pipelined systolic MAC (CSA-MBE), achieving ↓66.3% delay; ↑3.1× area efficiency; ↓82.2% typical power vs naïve conv3 baseline
• Used a 2D PE-grid structure for convolution (verified 0/same padding modes) and optimized GEMM (reducing power by 44.6%; N = 3)
• Added a 648-bit scan chain across all pipeline/control registers, enabling full DFT/ATPG testability with only +14.5% cell overhead
Click here for more details
Design of High-Performance Q1.7 Fixed-Point Quantized CNN Hardware Accelerator with Microarchitecture Optimization of 3-Stage Pipelined Systolic MAC Arrays for Lightweight Inference
A fully synthesizable INT8 CNN accelerator for CIFAR-10, built around a 3-stage pipelined systolic-array MAC microarchitecture optimized from a 6-design PPA benchmark.
Includes complete quantization workflow (PTQ/QAT), 2-cycle ready/valid protocol, ROM automation, FP32↔RTL accuracy checks, and a hardware image-processing suite
“ I tried to ImProVe, but NeVer really did - so I MOVe-d on ¯\_(ツ)_/¯ “
Project LinkVerilog | Basic Architecture | Digital Electronics
Designed a fully synthesizable shallow Res-CNN for CIFAR-10, evaluated against eight reference CNNs, and achieved a Pareto-optimal trade-off across throughput, latency, and accuracy.
Implemented systolic-array processing elements using 8-bit CSA–MBE MAC units with FSM-driven control logic and a 2-cycle read/valid handshake.
Verified end-to-end datapath behavior through structured testbenches.
Performed detailed post-training quantization (PTQ) and quantization-aware training (QAT) studies.
Quantizing from Q1.31 to Q1.3 allowed exploration of precision vs. accuracy trends, and a Q1.7 PTQ model retained ~84% accuracy (less than 1% drop) while reducing the model memory footprint by 4× (≈52 kB).
Developed automated scripts (TCL + Python) to generate 14 coefficient ROMs and 3 RGB input ROMs, enabling seamless hardware ingestion of model parameters and images.
Verified TensorFlow / FP32 ↔ RTL output consistency and automated full-pipeline inference execution.
Built a compact digital image-processing toolkit (edge detection, denoising, filtering, enhancement) and an MLP classifier for (E)MNIST datasets.
Achieved >75% accuracy with real-time GUI visualization for interactive experimentation.
Technical Summary
Designed a fully synthesizable INT8 CNN accelerator (Q1.7 PTQ) for CIFAR-10, optimized for throughput, latency determinism, and precision efficiency. Implemented a 2-cycle ready/valid handshake for all inter-module transactions and FSM-based control sequencing for deterministic pipeline timing. Trained 8 CNNs (TensorFlow, identical augmentation & LR scheduling w/ vanilla Adam optimizer) and performed architecture-level DSE via Pareto analysis, selecting 2 optimal variants including a ResNet-style residual CNN.
PTQ/QAT comparisons were conducted across Q1.31, Q1.15, Q1.7, and Q1.3; Q1.7 PTQ (1-int, 7-frac | 0.0078 step) gave the best accuracy–memory trade-off { Q1.31 ~84% ~210kB | Q1.7 ~83% ~52kB | Q1.3 ~78% ~26kB }, achieving ~84% top-1 accuracy, <1% loss, and ≈52 KB total (≈17×3 KB RGB inputs)
The 3-stage pipelined systolic-array convolution core employs Processing Elements (PEs) built around MAC units composed of 8-bit signed Carry-Save Adders (CSA) and Modified Booth-Encoded (MBE) multipliers, arranged in a 2D grid for high spatial reuse and single-cycle accumulation. All 14 coefficient ROMs and 3 RGB input ROMs were auto-generated via a Python/TCL automation flow handling coefficient quantization, packing, and open-source EDA simulation. Verified bit-accurate correlation between TensorFlow FP32 and RTL fixed-point inference layer-wise; an IEEE-754 single-precision CNN variant validated numeric consistency.
Integrated image-processing modules (edge detection, denoising, filtering, contrast enhancement) form a Verilog-based hardware preprocessing pipeline, feeding an MLP classifier evaluated on the (E)MNIST (52+)10 ByClass datasets. The MLP shares the preprocessing and automation flow, with an additional IEEE-754 64-bit FP variant for precision benchmarking. A Tkinter GUI enables interactive character input, and preprocessing visualization via Matplotlib
High-Speed 3-Stage Pipelined Systolic Array-Based MAC Architectures
Digital Logic Design | Synthesis
Benchmarked six different 8-bit signed adder–multiplier architectures using PPA metrics (latency, throughput, and area) on the Sky130 PDK, and analyzed their architectural trade-offs in the context of low-power convolution workloads.
Designed a 3-stage pipelined systolic MAC based on a CSA–MBE multiplier structure, achieving substantial improvements over a naïve conv3 baseline:
66.3% lower delay, 3.1× higher area efficiency, and 82.2% lower typical power.
Implemented a 2D systolic PE-grid supporting general convolution and GEMM operations, with verified behavior under zero-padding and same-padding configurations.
GEMM optimization reduced power consumption by 44.6% for matrix dimension (N = 3).
Integrated a 648-bit scan chain spanning all pipeline and control registers, enabling full DFT/ATPG support with only 14.5% cell-area overhead, ensuring manufacturability and high test coverage.
Technical Summary
Benchmarked six 8-bit signed adder and multiplier architectures for systolic-array MACs targeting CNN/GEMM workloads using a fully open-source ASIC flow (Yosys + OpenROAD/OpenLane) on the Google-SkyWater 130nm PDK (Sky130HS PDK @25°C_1.8V). Evaluated PPA (Power, Performance, Area) and latency/throughput/area metrics under a constant synthesis and layout environment with fixed constraints and floorplan parameters (FP_CORE_UTIL = 30 %, PL_TARGET_DENSITY = 0.36, 10 ns clock, CTS/LVS/DRC/Antenna enabled)
Final MAC integrates an 8-bit signed CSA adder and 8-bit signed MBE multiplier in a 3×3 convolution/GEMM core using a 3-stage pipelined systolic array (sampling → truncation/flipping → MAC accumulation). Verified via RTL testbench and post-synthesis timing across zero/same-padding modes. Automated GDS/DEF generation and PPA reporting for all architectures ensured fully reproducible, environment-consistent results
Comparative study (Small Scale Ops) :
CSA-MBE pair Systolic Array Conv vs Naïve Conv (3x3 Kernel on 5x5 Image @CLK_PERIOD_20_ns)
Latency ↓ 66.3% Throughput ↑ 196.6% Speed ↑ 196.6% Area ↓ 67.8% Power ↓ 82.2%
Single MAC re-use vs Systolic 4-PE Grid (2x2 matrix multiplication)
Latency ↓ 0.9% Throughput ≈ same Speed ≈ same Area ≈ same Power ↓ 15% Energy/op ↓ 14.6%
Single MAC re-use vs Systolic 9-PE Grid (3x3 matrix multiplication)
Latency ↓ 1% Throughput ≈ same Speed ≈ same Area ≈ same Power ↓ 44.6% Energy/op ↓ 44%
DFT (Scan Chain) Add-On
Integrated a 648-bit full-scan chain across all pipeline, control, and output registers in the 3×3 systolic MAC/convolution datapath.
Every state element is replaced with a single-bit scan DFF (SE/SI/SO), enabling serial load/unload of internal state.
Flattened register groups (kernel flip buffer, px16/px8 slices, ker8 slices, prod_s2 pipeline, row/col counters, valid pipe, output registers, done flag) into one contiguous chain, ensuring deterministic bit ordering and simple scan stitching.
Verified scan behavior through shift–capture–shift TB patterns and ensured functional transparency:
scan mode (SE=1) freezes datapath updates, while normal mode (SE=0) preserves original functionality.
Total overhead from scan insertion is ~14.5% in standard cells, with no change to functional timing or systolic pipeline throughput at the target clock.
Repositories
ViSiON – Verilog for Image Processing and Simulation-based Inference Of Neural Networks
This repo includes all related projects as submodules in one place
ImProVe – IMage PROcessing using VErilog: A collection of image processing algorithms implemented in Verilog, including geometric transformations, color space conversions, and other foundational operations.
NeVer – NEural NEtwork on VERilog: A hardware-implemented MLP in Verilog for character recognition on (E)MNIST, alongside a lightweight CNN for CIFAR-10 image classification
MOVe – Math Ops in VErilog
CORDIC Algorithm – Implements Coordinate Rotation Digital Computer (CORDIC) algorithms in Verilog for efficient hardware-based calculation of sine, cosine, tangent, square root, magnitude, and more.
Systolic Array Matrix Multiplication – Verilog implementation of matrix multiplication using systolic arrays to enable parallel computation and hardware-level performance optimization. Each processing element leverages a Multiply-Accumulate (MAC) unit for core operations.
Hardware Multiply-Accumulate Unit – Implements and compares 8-bit multipliers and 8-bit adders in synthesizable Verilog, analyzing their area, timing, and power characteristics in MAC datapath architectures.
Posit Arithmetic (Python) – Currently using fixed-point arithmetic; considering Posit as an alternative to IEEE 754 for better precision and dynamic range. Still working through the trade-off.
Storage and Buffer Modules
RAM1KB – A 1KB (1024 x 8-bit) memory module in Verilog with write-once locking for even addresses. Includes a randomized testbench. Also forms the base for a ROM3KB variant to store 32×32 RGB CIFAR-10 image data.
FIFO Buffer – Synchronous FIFO with param. depth, single clock domain, and standard full/empty flag logic.
RISC-V & MIPS Microarchitectures - SC / MC / Pipelined / Dual-Issue Superscalar |
Link
• Built a two-wide in-order superscalar RISC processor with parallel IF–ID–EX–MEM–WB lanes and independent pipeline registers per lane
• Designed dual 16-bit instruction fetch per cycle with inter-lane dependency checks, hazard suppression, and load-use stall handling
• Implemented 4R2W register file, RAW/WAW detection, branch squashing, & multi-port memory for concurrent fetch and data access
• Evaluated SC/MC/5-stage pipelined designs via directed programs (assembly with RISCV GCC Toolchain & QEMU reference), analyzing CPI (1/3.8/1.6), cycle counts, and hazard overhead
Tools: Verilog | Icarus Verilog | ModelSim | Quartus Prime
Designed a 32-bit single-cycle RV32I core with modular datapath: ALU, control, immediate generator, PC logic, instruction & data memories.
Implemented all 38 base instructions including full load/store support: LB, LBU, LH, LHU, LW, SB, SH, SW, with correct zero/sign extension and RMW correctness.
Verified correctness using self-checking ModelSim testbenches and synthesized the core on Quartus Prime.
MIPS Microarchitectures - SC / MC / 5-Stage Pipeline
• Designed non-pipelined/pipelined/scan-enabled 4-stage ALU; pipeline FFs replaced with scan FFs for scan-in/capture/scan-out
• Gate-level timing analysis in Yosys/OpenSTA (Sky130) with clock uncertainty, I/O delays, & input slew; ∼1.7x fmax gain with pipelining
• RTL2GDS flow: scan vs no-scan, single vs dual scan, CTS skew tightening, util/density & floorplan stress; closed timing throughout
• Analyzed IO-driven routing effects; worst-case pinning increased clock wire length by >2x despite CTS/placement optimization
• Recovered clock routing via pin-arch optimization (>50% clk WL ↓); PDN stress at signoff showed +21% total & +58% switching power
Click here for more details
A small arithmetic and comparison unit used to quantitatively study the effects of pipelining, scan-chain insertion, and physical design constraints on area, timing, routing, clocking, and power.
The project compares RTL-only variants using Yosys and OpenSTA, and extends the scan-pipelined design through full RTL-to-GDS physical design using OpenLane (Sky130).
Design Variants
Non-pipelined – Fully combinational datapath
Pipelined – Four pipeline stages with register boundaries
Scan-pipelined – Same pipeline depth, all registers replaced with scan-enabled FFs
All variants implement identical arithmetic and comparison logic; only register, scan, and physical constraints differ.
Synthesis & Area Results (Yosys)
Metric
Non-Pipe
Pipe
Scan-Pipe
Total cells
37
50
89
Flip-flops
0
13
13
Chip area (µm²)
538.66
891.91
1162.04
Area overhead
Pipelining: +65.6%
Scan (on top of pipelining): +30.3%
Baseline → scan-pipelined: +116%
Scan insertion adds ~3 logic gates per flip-flop, visible as increased NAND/OR cell counts.
Timing Results (OpenSTA, baseline constraints)
Metric
Non-Pipe
Pipe
Scan-Pipe
Critical delay (ns)
1.67
0.99
1.14
Slack (ns)
0.33
0.88
0.74
fmax (MHz)
598
1010
877
Pipelining reduces critical path by 40.7%
fmax improves by ~1.7×
Scan adds ~150 ps to register-to-register paths (~13% fmax loss)
Hold timing is clean in all variants; scan logic increases minimum delay and improves hold margin.
Pessimistic STA (Interface-Aware Constraints)
Constraints added:
50 ps clock uncertainty
1.0 ns input/output delays
Non-zero input slew
Design
Critical Path
Arrival (ns)
Slack (ns)
Non-pipe
Input → Output
2.83
−1.88
Pipe
Input → Stage-1 FF
1.54
+0.27
Scan-pipe
Input → Scan FF
1.69
+0.14
Non-pipelined design fails timing under realistic constraints
Pipelining absorbs interface delay
Scan reduces slack by ~48%, with no change in pipeline depth
The scan-pipelined variant was progressively stressed and optimized through ten controlled PD experiments (E1–E10).
Each experiment modified one dominant physical knob, while maintaining a 4.0 ns clock and clean signoff unless explicitly stated.
E1–E10 PD Experiment Summary
Exp
Primary Change
Quantitative Outcome
E1
Scan baseline, 60% util, CTS skew 0.1 ns
Post-CTS WNS +1.27 ns, total power ~8.99e-04 W
E2
Scan removed (control)
Cells 82 → 52, synth WNS +2.22 ns
E3
CTS skew tightened to 0.05 ns
Timing unchanged, power ~2× increase
E4
FP_CORE_UTIL = 80%
Wirelength 26601 → 26771, WNS ~1.9 ns
E5
Dual scan chains
Clock latency 0.68 → 0.63 ns, power 1.03e-03 W
E6
PL_TARGET_DENSITY = 0.85
GPL WNS 1.61 → 1.48 ns, DPL recovered
E7
Channelized floorplan
Clock net ~745 → ~788 µm, −70 ps slack
E8
Worst-case IO pin placement
Clock net ~1525 µm (≈2×)
E9
Pin-architecture fix (clock isolated)
Clock net ~703 µm (−54%)
E10
PDN tightening (IR stress)
Total power +21%, switching +58%
Routing Geometry Across PD Stages
Metric
E6
E7
E8
E9
E10
Clock net length (µm)
~745
~788
1525
703
~703
Longest net (µm)
~1075
~875
~1235
~1168
~1168
scan_out length (µm)
~128
~329
~258
~205
~205
Timing Closure (Post-Route / Signoff)
Metric
E6
E7
E8
E9
E10
Setup WNS (ns)
~1.45
~1.38
↓
~1.12
Closed
Hold WNS (ns)
~0.22
~0.22
~0.22
~0.36
Closed
TNS / WNS
0 / 0
0 / 0
0 / 0
0 / 0
0 / 0
Power Impact (Signoff, RCX)
Metric
E9
E10
Total power (W)
7.48e-04
9.06e-04
Switching power (W)
2.69e-04
4.25e-04
Power delta
—
+21% total, +58% switching
CTS-stage power remained unchanged; PDN stress manifested only at signoff.
Extended Key Takeaways (RTL + PD)
Pipelining trades ~66% area for ~70% frequency improvement
Scan insertion adds ~30% area and ~150 ps register delay
CTS skew tightening impacts power, not timing
Utilization and density ineffective below ~10% real placement density
Dual scan reduces clock latency but increases routing power
Floorplan topology affects routing more than timing
IO pin placement can cause >2× clock routing inflation
PDN tightening increases signoff switching power by ~58% with timing intact
Repository
AHB–APB Bridge with Self-Checking Verification |
Link
• Designed a parameterizable AHB-Lite to APB bridge with FSM-based control supporting single & burst read/write transactions
• Implemented address/data latching, write buffering, read return, and burst sequencing, handling pipelined and non-pipelined accesses
• Built a self-checking SV testbench with macro-controlled test modes (single/burst R/W) and assertion-based data validation
• Verified protocol correctness across all transaction types; additionally designed & verified standalone (I2C/SPI/UART) peripheral controllers
Click here for more details
A parameterizable AHB-Lite to APB bridge implemented in SystemVerilog, with a self-checking verification environment.
The design translates AHB transactions into APB protocol sequences using FSM-based control and supports both single and burst transfers.
AHB–APB Bridge – RTL Design & Verification
Tools: SystemVerilog | Icarus Verilog | GTKWave
Designed a configurable AHB-Lite → APB bridge supporting single and burst read/write transactions.
Implemented FSM-based control logic handling address/data latching, write buffering, read-data return, and burst sequencing.
Supports pipelined and non-pipelined AHB accesses, with correct handling of transaction type and transfer continuation.
Built a self-checking SystemVerilog testbench with compile-time macro selection for:
single read
single write
burst read
burst write
Integrated assertion-based data validation, protocol correctness checks, and automatic PASS/FAIL reporting.
Verified functionality using Icarus Verilog, with per-test VCD waveform generation for debug and inspection.
Repository
Design & Formal Verification of Parameterizable Fixed-Point CORDIC IP |
Link
• Implemented shift-add datapath with all 6 modes rotation/vectoring (circular/linear/hyperbolic); width/iter/angle frac/output width–shift scaling swept across configs
• Built trig/mag/atan2/mul/div/exp wrappers; observed ∼e-5 RMS (@32b, 16iter) baseline vs double-precision references
• Proved handshake, deadlock-free bounded liveness, range safety, symmetry & monotonicity via SystemVerilog assertions (SymbiYosys/Yices2)
• Auto-generated atan tables & param files via Python; FuseSoC-packaged core with documented sensitivity, error trends & failure regions
• Built drop-in core variants (pipelined/SIMD/multi-issue); implemented a QAM16 demodulator using the CORDIC core
Click here for more details
A fully synthesizable, parameterizable fixed-point CORDIC soft IP supporting rotation and vectoring modes, with dedicated wrappers for trigonometric and vector operations. The project emphasizes formal verification of control, protocol, and mathematical structure, complemented by simulation-based accuracy characterization and parameter sensitivity analysis.
Designed a single iterative datapath reused across all modes, with:
deterministic latency (ITER + 1 cycles)
clean start / busy / valid handshake
mode-selectable update equations
explicit preconditioning for vectoring modes to ensure convergence and sign correctness
Fully parameterized design:
internal data width
iteration count
angle / argument precision
output scaling and truncation
enabling consistent trade-off exploration between accuracy, area, and latency.
Built standalone wrapper modules for common functions that apply:
gain compensation (mode-dependent)
output scaling and truncation
format conversion while preserving a uniform control interface across all operations.
Extended functionality beyond trigonometric primitives to include:
exp, ln
mult, div
sinh, cosh, tanh achieving ~1e-5 absolute precision under typical configurations.
Auto-generated lookup tables and parameter headers using Python for:
circular arctangent
hyperbolic atanh
iteration scheduling (including repeated iterations where required)
ensuring reproducibility across parameter sweeps.
Packaged the core and wrappers using FuseSoC, with documented:
configuration sensitivity
convergence behavior per mode
known numerical edge cases and failure regions
Accuracy Characterization & Parameter Sensitivity
Circular rotation mode (sin/cos) converges exponentially with iteration count until limited by fixed-point quantization. With WIDTH = 32, ITER = 16, and 30-bit angle precision:
RMS error: ≈ 3.9 × 10⁻⁵
Max error: ≈ 7 × 10⁻⁵
Circular vectoring mode (magnitude) shows similar convergence:
RMS magnitude error: ≈ 2.6 × 10⁻⁵ at WIDTH = 32, ITER = 16
Linear mode (mult/div) reaches target precision rapidly:
error dominated by final truncation rather than iteration depth
stable convergence across full dynamic range when properly prescaled
Hyperbolic mode (exp, sinh, cosh, tanh, ln):
requires scheduled repeated iterations for convergence
achieves ~1e-5 absolute error under typical fixed-point settings
accuracy primarily limited by argument range reduction and output scaling
Tangent and atan2 remain numerically ill-conditioned:
tan(x) error dominated by 1 / cos(x) near ±π/2
atan2(y,x) sensitive near (x,y) → (0,0)
accuracy does not improve meaningfully beyond modest iteration counts
Identified and documented deterministic numerical failure regions:
mismatched output shifts → sign inversion or clipping
Formal Verification
The core iterative datapath and control logic have been formally verified:
convergence direction selection
iteration sequencing
handshake correctness
quadrant handling in vectoring modes
Formal analysis is focused on the base CORDIC iteration engine and control FSM. Extended mathematical functions (linear and hyperbolic modes, transcendental wrappers) are designed and validated through simulation and numerical analysis, and are intentionally out of scope for formal proof.
Drop-IN CORDIC Cores
Family
Keywords
A
iterative, control, low-area
B
feed-forward, streaming, throughput
C
SIMD, spatial, vector
D
multi-issue, interleaved, shared datapath
Family A — Control-Oriented / Iterative
Core ID
What it is
What it does
A1
Iterative, rolled datapath
Minimum-area baseline core
A2
Iterative, unrolled
Reduced latency, no pipelining
A3
Iterative + micro-pipeline
Shortened critical path (2-stage)
A4
Iterative + deeper micro-pipeline
Higher Fmax (3-stage)
Family B — Throughput-Oriented / Feed-Forward
Core ID
What it is
What it does
B1
Fully unrolled, single-stage
One result per cycle, minimal depth
B2
Unrolled, fixed-cut pipeline
Simple 2-stage throughput pipeline
B3
Unrolled, balanced pipeline
Evenly distributed 2-stage pipeline
B4
Unrolled, deep pipeline
3-stage, high-frequency design
Family C — SIMD / Spatial / Vectorized
Core ID
What it is
What it does
CS
Scalar unrolled
Single-lane reference
CR2
Replicated ×2
Two parallel lanes
CR4
Replicated ×4
Four parallel lanes
CSR4
Replicated ×4 + shared ROM
Area-reduced quad lane
CVEC4
Vector-packed SIMD
Packed SIMD datapath
CVEC5
Refined SIMD
Shared resources, optimized
Family D — Multi-Issue / Time-Interleaved
Core ID
What it is
What it does
D2X2S
2 issues × 2 lanes
Simple dual-issue core
D4X2S
2 issues × 4 lanes
Higher SIMD width
D4X2P
2 issues × 4 lanes + pipeline
Improved Fmax
D4X4S
4 issues × 4 lanes
High concurrency
Technical Summary
The CORDIC core implements an iterative micro-rotation algorithm using only shifts and additions. Internal state updates follow the standard recurrence:
Rotation mode
xi+1=xi−di,(yi≫i)
yi+1=yi+di,(xi≫i)
zi+1=zi−di,arctan(2−i)
di=+1 if zi≥0
di=−1 if zi<0
Vectoring mode
xi+1=xi+di,(yi≫i)
yi+1=yi−di,(xi≫i)
zi+1=zi+di,arctan(2−i)
di=+1 if yi≥0
di=−1 if yi<0
The datapath is fully parameterized by:
internal width (WIDTH)
iteration depth (ITER)
angle fractional bits
output width and scaling shifts
Wrappers & Scaling Semantics
Wrappers initialize the core with gain-compensated vectors and extract results via deterministic right shifts:
sin / cos: final y / x outputs
tan: extended-precision ratio output
mag: final x magnitude
atan2: accumulated angle output
Correct scaling is mandatory; mismatched shifts reproducibly cause saturation, amplitude loss, or numerical collapse.
Formal Verification Strategy
Formal verification focuses on structural correctness, not numerical accuracy.
Common properties (all modules):
single-cycle start and valid
no overlapping transactions
bounded progress (≤ ITER + 1 cycles)
deadlock-free operation
input stability while busy
output range safety
Rotation mode assertions:
odd/even symmetry (sin(-x), cos(-x))
local monotonicity near zero
sign consistency across sin/cos/tan
Vectoring mode assertions:
magnitude symmetry across quadrants
correct zero-vector handling
atan2 odd symmetry and π-rotation consistency
All properties are written in purely parametric SVA, proven using SymbiYosys + Yices2, and scale automatically with configuration.
Numerical accuracy is intentionally validated via simulation and reference comparison, not formal proof.
Repository
Peripheral Serial Communication Protocols - I2C / SPI / UART-TX
Link
Implemented a collection of peripheral serial communication interfaces in Verilog, focusing on synthesizable, parameterizable controllers suitable for FPGA/ASIC integration. Each protocol includes a cleanly modularized interface, configurable timing parameters, and testbench-driven validation with waveform inspection.
I2C Master Controller – Implements a single-master, multi-slave I²C bus supporting standard-mode timings, programmable SCL low/high periods, ACK/NACK handling, and clock stretching detection. Features deterministic START/STOP generation, byte-wise transfers, and address+data framing logic.
SPI Master (Modes 0–3) – Supports CPOL/CPHA mode selection, 8-bit full-duplex transfers, configurable SCLK division, selectable slave-select behavior, and MSB-first shifting. Designed with a compact FSM and separate TX/RX shift registers for predictable cycle behavior.
UART TX Soft-Core IP – Lightweight serial transmitter with baud-rate generator, start/stop framing, data-valid gating, and FIFO-less single-byte serialization. Fully synthesizable and intended as a drop-in peripheral for SoCs, teaching cores, or FPGA peripheral sets.
Repositories
Basic Python Tool for ISCAS’85/’89 Benchmark Analysis & Fault-Modeling |
Link
A work-in-progress open-source Python package implementing foundational DFT/Fault-modeling utilities for ISCAS’85 and ISCAS’89 benchmark circuits. Includes automatic Verilog netlist generation, random testbench creation, serial/parallel fault simulation, fault collapsing, SCOAP metric computation, and initial ATPG experimentation. PODEM implementation is under development.
Generates structural Verilog (.v) files and matching randomized testbenches from ISCAS netlists; injects stuck-at faults using ID-indexed assign overrides appended to the generated modules.
Implements serial fault simulation and parallel bit-packed fault simulation, enabling coverage estimation under random vector sets.
Automated reporting: coverage tables, detected/undetected fault lists, fault dictionaries, and comparison across vector batches.
Supports fault collapsing using dominance & equivalence relations; identifies FFR-based partitions and reduces fault sets prior to simulation.
Computes SCOAP controllability (CC0/CC1) and observability (CO) metrics for every internal node, enabling analysis of circuit hard-to-control or hard-to-observe regions.
Initial ATPG experiments performed using SCOAP-guided heuristics; current PODEM implementation is incomplete due to recursion termination issues being debugged.
Planning to add scan-chain insertion workflows for ISCAS’89 sequential circuits, enabling full-scan ATPG comparisons with random simulation.
Repositories
Predictive ML EDA Framework for Routability Congestion Prediction |
Link
WORK IN PROGRESS
Developed a deep learning–based EDA framework to predict routing congestion heatmaps prior to detailed routing, enabling early-stage routability awareness during global placement.
Trained an encoder–decoder convolutional neural network using the CircuitNet-N14 dataset based on 14 nm FinFET technology, covering diverse designs including RISC-V cores, GPUs, and ML accelerators.
Modeled congestion behavior using placement-stage features such as macro region density and RUDY-based routing demand representations.
Implemented an end-to-end PyTorch training and data ingestion pipeline for learning pixel-level congestion patterns from backend design data.
Generated spatial congestion maps highlighting routing hotspots to support placement optimization and backend design decision-making.
Repository
FIR Accelerator for Microwatt | OpenPOWER Hardware Hackathon
Link
A parameterizable FIR filter acceleration block designed for Microwatt/OpenFrame systems.
Implements a sequential multiply accumulate datapath, programmable coefficients, and a clean Wishbone-Lite register interface for CPU control.
The design was accepted for potential fabrication during the hackathon review. The final taped-out submission was not completed due to timing constraints.
FIR Accelerator - Sequential Fixed-Point Filtering Unit
A collection of beginner-friendly ASIC design experiments where I am still learning the full RTL to GDS flow.
I automated synthesis steps, generated schematic views, pushed complete flows through OpenLane and OpenROAD, produced layout snapshots, and verified the final GDS outputs.
Both designs behave as compact test vehicles for understanding physical design stages and polishing my flow setup.
A focused functional verification environment built around the open-source SHA256 core from secworks.
I created a structured suite of directed, random, corner-case, and negative fail-case tests, automated through a TCL-driven flow that compiles, runs, checks, and aggregates results.
The goal was to push the core through a wide coverage surface, verify digest correctness, exercise interface timing, and validate the robustness of the verification infrastructure itself.
Implemented a unified verification environment with fully self-checking testbenches.
Automated all compilation and simulation using TCL so that every test runs in a single command.
Verified correct digest generation for standard, multi-block, random, and corner-case stimuli.
Stressed control behavior by injecting malformed sequences, undefined values, and reversed block ordering.
Collected per-test logs and a consolidated summary to rapidly detect regressions.
Treated testbench infrastructure as a verification target by validating mismatch detection and protocol violation handling.
Technical Summary
The SHA-256 core is verified directly against the mathematical definition of its compression function. Each testbench checks that the DUT produces digests matching the iterative hashing rule
H(0)=HIV
and for each message block M(i):
H(i+1)=F(H(i),M(i))
Round Computation
Each round computes two temporary values:
T1=h+Σ1(e)+Ch(e,f,g)+Kt+Wt
T2=Σ0(a)+Maj(a,b,c)
Then the working registers update as:
h=g g=f f=e e=d+T1 d=c c=b b=a a=T1+T2
Message Schedule
For each block the scheduler generates:
Wt=Mt(t<16)
Wt=σ1(Wt−2)+Wt−7+σ0(Wt−15)+Wt−16(t≥16)
Verification Focus
The environment validates:
correct initialization of internal hash state
correct update propagation through all 64 rounds
multi-block chaining behavior for long messages
final digest stability when digest_valid asserts
robustness under random, corner-case, and adversarial stimuli
TCL Automation
The TCL flow executes a deterministic verification loop:
compile all testbenches
run all simulations
compare expected vs actual digests
merge logs into a single summary report
This provides a repeatable regression pipeline for SHA-256 functional verification.
Verification Coverage
Single-block digest generation for standard vectors
Multi-block message chaining and state propagation
Randomized 512-bit stimuli to probe broad input space behavior
Corner patterns including all-zero, all-one, and alternating data
Mode selection behavior across supported operational modes
Protocol timing correctness for control signals
Intentional failure injection to test mismatch handling and robustness
Automation Workflow
The TCL flow executes the entire suite:
compile all tbs
run all tests
collect logs
summarize results
This produces structured outputs with test names, expected vs actual digests, and overall regression status.
Repository Card
Analog Circuits & Device-Level Design
Device Modeling using Sentaurus TCAD |
Link
Performed semiconductor device modeling using Synopsys Sentaurus for foundational structures including N-type resistors, PN diodes, and NMOS transistors.
Explored how doping profiles, junction depths, geometry parameters, and physical models impact device characteristics through calibrated simulations and scripted workflows.
Overview
Built parameterized device structures (concentration profiles, implant energies, lateral/vertical dimensions) using Sentaurus Structure Editor and process definition files.
Configured Sentaurus Device with transport and recombination models (SRH, Auger, mobility models, incomplete ionization where relevant) to study semiconductor behavior under applied bias.
Automated simulation runs in Sentaurus Workbench using command-based .cmd and .des scripts for sweeping doping levels, voltages, and geometry parameters.
Analyzed simulation output with Sentaurus Visual/Inspect, examining electrostatic potential maps, electron/hole concentration distributions, electric field intensity, and I–V characteristics.
Extracted device metrics such as diode forward/reverse characteristics, NMOS transfer/output curves, threshold behavior, and resistance scaling for the N-type resistor.
Repository
CMOS Inverter Layout (Magic VLSI) & Ngspice Simulation |
Link
A complete CMOS inverter implementation built using Magic VLSI (SCMOS) for physical layout and ngspice for extracted-device simulation. Covers device construction rules under the SCMOS process, physical layout of PMOS/NMOS devices, contact/tap structures, parasitic-aware extraction, and transient analysis of inverter switching characteristics.
The layout follows the SCMOS ruleset:
PMOS implemented inside an n-well using p-diffusion; body tied to the well tap (VDD).
NMOS implemented directly in the p-substrate using n-diffusion; body tied to substrate tap (GND).
Poly crossing active regions forms the MOS channel; poly, metal1, and contact stack-up follows SCMOS vertical connectivity.
Extraction produces a transistor-level .spice netlist including geometry-derived parasitics. Transient simulation evaluates:
Static noise margins and switching point displacement due to device sizing.
Rise/fall asymmetry from mobility difference (μₙ ≫ μₚ).
Output slew vs. load capacitance and PMOS/NMOS drive ratio.
Propagation delays under 1.8 V operation using level-1 MOS models.
DC Analysis Results
Parameter
Value
VOH (Output High Voltage)
1.800001 V
VOL (Output Low Voltage)
1.885403e-08 V
VM (Switching Threshold)
0.9502793 V
Temperature
27 °C
Number of Data Rows
1801
Transient Analysis Results
Parameter
Value
Unit
TPHL (High → Low Delay)
2.823489e-10
s
TPLH (Low → High Delay)
2.160067e-10
s
Rise Time (trise)
5.010355e-10
s
Fall Time (tfall)
5.307928e-10
s
Average Current (iavg)
-1.395092e-06
A
Average Power (pavg)
2.51117e-06
W
Temperature
27
°C
The repository includes the Magic layout (.mag), extracted netlists, wrapper files for stimulus, and generated ngspice waveforms.
Repository
Two-Stage CMOS Operational Amplifier with Miller Compensation |
Link
A two-stage CMOS op-amp designed in TSMC 180 nm, using an NMOS differential input pair with PMOS current-mirror load, followed by a common-source second stage.
Frequency compensation is implemented using a Miller capacitor between the first-stage output and the second-stage output node, producing dominant-pole behavior and stable unity-gain operation.
Device dimensions were set from closed-form analog constraints:
Slew-rate requirement → tail bias current and overdrive allocation
GBW requirement → input-pair transconductance and CC relationship
ICMR bounds → saturation margins for the differential pair and tail device
Output swing → overdrive and saturation limits for the second stage
Pole-splitting → ratio gₘ₆/CC and non-dominant pole placement
Simulation results:
Open-loop gain: ~53.1 dB
Unity-gain bandwidth: ~4.35 MHz
Dominant pole: ~9.6 kHz
Phase margin: ~60° with Miller compensation
Slew rate: ~10 V/µs from Ibias/CC
Output swing: ~0.14 V to ~1.03 V (linear region, no distortion at 1 kHz)
CMRR: ~32 dB
PSRR: +64.6 dB / –80.8 dB
Power consumption: ~1 mW with ±2.5 V rails
Operating-point analysis confirms all MOS devices remain in saturation with expected overdrive values, and both transient and AC characteristics match analytical pole/zero predictions for a Miller-compensated two-stage topology.
A 5-stage CMOS inverter ring used as a voltage-controlled delay line, producing oscillation whose frequency scales with the control voltage.
A 3-stage buffer isolates the oscillator core and restores the internal sine-like waveform into a full-swing CMOS square wave.
The oscillator operates from 0.7–3.0 V control input and shows a monotonic delay reduction with increasing drive strength.
Measured characteristics
Frequency range: 0.724–1.93 GHz
Linear KVCO region: ~2.1 GHz/V for 0.7–1.2 V
Frequency saturation: begins above ~1.8 V as inverter delay approaches its minimum
Core waveform: ~0.3–1.7 V swing with rounded edges
Buffered output: 0–1.8 V square wave, ~50% duty cycle
Startup time: ~0.5–0.8 ns to reach steady oscillation
Simulation sweep: confirmed monotonic f–V relation and early compression through parametric input stepping
Frequency points
Vctrl (V)
f (GHz)
0.7
0.724
0.8
1.107
1.0
1.59
1.2
1.76
1.5
1.88
2.0
1.92
2.5
1.928
3.0
1.9298
Repository
Analog Function Generator with Adjustable Amplitude/Offset/Phase |
Link
A multi-waveform analog function generator built using discrete op-amp blocks (TL082), passive RC networks, and a CD4051 analog multiplexer.
The generator produces sine, square, and triangular outputs and exposes continuous control of amplitude, DC offset, and phase.
Additional AM/PM blocks and a relaxation-oscillator VCO extend the system for modulation experiments.
The signal path is fully modular-each block is buffered to avoid inter-stage loading errors, enabling predictable behavior across a 1 kHz–500 kHz operating band.
Design, analysis, and LTspice simulation of audio power amplifier classes, combining analytical power calculations, circuit-level simulation, and audio-domain validation.
Digital audio signals are preprocessed using FFmpeg and injected directly into LTspice for waveform-accurate listening tests.
Scope
Class B & Class AB push-pull power amplifiers
Complementary BJT output stage under ±20 V supply
Analytical evaluation of input power, output power, and efficiency
Demonstrates crossover distortion and diode-biased reduction
Class A, B, AB audio amplifiers (LTspice)
DC operating-point, AC, and transient analysis
FFmpeg-based MP3 → WAV conversion for PWL/audio voltage sources
Exported output WAV files for subjective audio comparison
Key Results
Class B efficiency: ~67% (severe crossover distortion)
Class AB efficiency: ~69% with significantly reduced distortion
Class A: lowest distortion, highest power dissipation
Class AB: best overall efficiency–audio quality tradeoff
The repository includes LTspice schematics, FFmpeg preprocessing workflow, analytical calculations, and simulation/audio outputs.
Repository
CMOS Bandgap Reference (DC, Temp, Line, Monte Carlo) |
Link
LTspice simulation and characterization of a CMOS bandgap reference implemented in an OSU 180 nm CMOS process, with emphasis on Monte Carlo mismatch analysis alongside functional verification.
The work validates temperature compensation, line regulation, startup behavior, and statistical variation of Vref.
Scope
Functional verification
DC operating point and transient startup
PTAT / CTAT cancellation across temperature
Vref stability vs temperature (−40 °C to 140 °C)
Line regulation under supply sweep (2 V–4 V)
Zero-current equilibrium and startup necessity
Monte Carlo mismatch analysis
Statistical Vref variation due to resistor mismatch
Operating-point–based evaluation at nominal conditions
Monte Carlo Results (100 Runs)
Mean Vref: 1.20892 V
Minimum Vref: 1.19687 V
Maximum Vref: 1.22191 V
Standard deviation (σ): 4.60 mV
Peak-to-peak spread: ~25 mV (≈ 3800 ppm)
The results indicate that reference accuracy is dominated by passive matching, consistent with an untrimmed CMOS bandgap implementation. No curvature correction or trimming was applied.
The repository includes LTspice schematics, simulation directives, waveform data, and Monte Carlo result plots.
Repository
Precision PID Controller Design using Operational Amplifiers |
Link
An analog PID controller built using high-linearity op-amps (LT1007 / TL082) and RC networks, implemented entirely in continuous time and validated through LTspice.
The design focuses on stable low-frequency integration, controlled differentiation without noise peaking, and diode-based output limiting for robust transient behavior.
Two complete controller variants were implemented-one minimal, one extended with gain scaling and anti-windup.
Measured / designed characteristics
Differential stage: unity-gain differential amplifier with high CMRR for clean error sensing
Integrator: 10 ms time constant →
(K_i \approx 100\ \text{s}^{-1})
(f_c \approx 16\ \text{Hz})
Loop-gain boost ≈ 9.5 dB
Derivative network: RC shaping with controlled high-frequency roll-off to prevent noise amplification
Output swing protection: diode clamps maintaining bounded actuation signal under large transients
Op-amp choices: LT1007 for low noise and precision; TL082 as a low-cost, wide-bandwidth alternative
Simulation: full closed-loop Bode, transient, load-step and saturation recovery tests in LTspice
Second PID variant
10× front-end gain for small-signal plant feedback
Dual-integrator configuration for deeper low-frequency suppression
Anti-windup: diode shunts + soft-limiting network to prevent integrator runaway
Stable recovery under saturation and high-error conditions
Design intent
preserve linearity and phase margin across low-frequency operation
condition derivative action to avoid overshoot due to high-frequency noise
offer two architectures: a clean textbook PID and a high-authority PID with controlled limiting
Repository
Robotics and Applied ML (Secondary)
ANAV for Martian Surface Exploration / GNSS-Denied Environments (ISRO IRoC-U 2025) |
Link
A sub-2 kg autonomous quadrotor designed for GNSS-denied navigation, visual–inertial localization, mapping, and safe-zone landing, using onboard compute, stereo sensing, and redundant measurement sources.
Built a <2 kg quadrotor integrating Jetson Nano for onboard processing and Pixhawk 4 for attitude/stability, targeting GNSS-denied missions requiring drift-constrained localization and controlled landing.
Completed ESC calibration, thrust-balancing, and regulated 5 V / 3 A power distribution using BEC modules for stable sensor/compute operation under load variations.
Integrated barometer, optical flow, and stereo-IMU sensing for multi-source position estimation with fallbacks against low-texture drift.
Fused RealSense D435i stereo + IMU using VINS-Fusion (ROS2) and evaluated against ORB-SLAM3, achieving <5 cm drift over ~5 m trajectories in indoor GNSS-denied tests.
Implemented ESP-Now telemetry using ESP32 modules with ~500 m LOS range for transmitting state, estimation residuals, and system health.
Verified autonomous landing on 1.5 m × 1.5 m clear regions and tolerances up to ~15° surface inclination.
Simulated Mars-like flight (~0.38 g gravity, no-GPS) in Webots, validating drift behavior, landing accuracy, sensing degradation, and control limits.
Technical Summary
Integrated Jetson Nano with Pixhawk 4 for onboard computation and flight handling, with calibrated ESCs and thrust mapping ensuring stable lift and attitude control for a <2 kg platform. Power regulation used a 5 V / 3 A BEC, isolating sensor/compute loads from motor-induced voltage drops.
Performed extrinsic and intrinsic calibration for the RealSense D435i (stereo + IMU) and aligned timestamps between Jetson and Pixhawk sources. Evaluated VIO accuracy using VINS-Fusion and ORB-SLAM3, testing sensitivity to feature density, motion blur, low-texture floors, and illumination. Achieved <5 cm drift over ~5 m sequences with optimized IMU noise parameters and RANSAC thresholds.
Connected barometer, optical-flow, and external sensors to Pixhawk over I2C/UART. Configured EKF2 to combine IMU, barometer, and flow when stereo data deteriorates. Implemented consistency checks between VIO and Pixhawk position estimates; deviations above a fixed threshold (~8–10 cm) trigger reliance on flow + barometer only.
Implemented long-range ESP-Now telemetry between two ESP32 modules. Achieved ~500 m line-of-sight operation and <15 ms median latency. Data included estimated position, VIO confidence, EKF residuals, battery, and attitude.
Developed a method for landing region selection using disparity maps and IMU tilt. Evaluated a 1.5 m × 1.5 m safe area requirement; system rejected regions with irregular height profiles or slopes >15°. Confirmed consistent landings on textured and partially textured surfaces.
Conducted GNSS-denied simulations in Webots, setting gravity to 0.38 g to approximate Martian conditions. Assessed altitude holding, drift accumulation, and safe-area approach across multiple terrains. Logged estimator drift, thrust reserve, and landing dispersion to validate repeatability under constrained sensing.
A computer-vision–driven Rubik’s Cube solver built around color detection, face reconstruction, and algorithmic solution generation.
The system extracts cube state using calibrated imaging and solves it via a Kociemba two-phase search, which operates over the full 43,252,003,274,489,856,000 (~4.3×10¹⁹) state space of a standard 3×3 cube.
Live Demo:Live Demo - interactive cube visualization and solver interface.
Blog Series (PID – Project in Detail):Blog Series - detailed explanation of the color-space math, permutation constraints, cube group theory, and the intuition behind the solving algorithm.
Vision Processing and Cube State Extraction
Performed HSV-based per-face calibration with adjustable saturation/value envelopes to stabilize under variable illumination.
Applied contour filtering and grid isolation after morphological denoising to lock onto a valid 3×3 cell arrangement.
Executed homography-based perspective correction and grid segmentation, assigning colors by mean-HSV dominance.
Combined all six captures into a canonical 54-character cube state string, checked for:
valid center-orientation mapping,
edge/corner permutation parity,
orientation sum constraints (edges mod 2, corners mod 3).
Solution Generation and Simulation
Used a Unity visualization environment for state verification, stepwise execution, and intermediate-move replay.
Integrated Kociemba’s two-phase algorithm with explicit details:
Phase 1: reduces the cube into the H subgroup by constraining edge orientation (2¹¹ states), corner orientation (3⁷ states), and UD-slice edge placement (12 choose 4).
Search explores ≈ ≈2.2×10¹⁰ possibilities but prunes aggressively using precomputed coordinate tables.
Phase 2: solves from H to the identity using restricted move set and coordinated distance tables over ≈ 1×10⁹ admissible states.
Typical generated solutions fall in the 18–22 move range (quarter-turn metric), with occasional optimal-length sequences for favorable states.
Viewer supports interactive updates, solution playback, and direct manipulation of state representations.
Repositories
MRI-Based Alzheimer’s & MCI Classification using 3D CNNs |
Link
Implemented a full 3D medical-imaging classification pipeline for Alzheimer’s, MCI, and cognitively normal subjects using PyTorch/MONAI. Focused on volumetric preprocessing, stable normalization across scanners, and architecture search over 3D convolutional backbones.
Designed a unified DICOM/NIfTI preprocessing flow with voxel-size normalization, spatial reorientation, intensity Z-scoring, Nyúl histogram standardization, and optional radiomic-feature augmentation.
Built data transforms with 3D affine jitter, elastic deformation, anisotropic scaling, and bias-field augmentation to model scanner variability.
Implemented Med3DNet-style 3D CNNs with custom heads: channel-progressive blocks, SE/CBAM attention, depth-scheduled 3D convolutions, and dropout tuned via Bayesian optimization.
Used MONAI’s sliding-window inference, smart-cache loading, and mixed Gaussian/Rician noise regularization for stable training on full MRI volumes.
Performed Bayesian hyperparameter search over learning rates, kernel schedules, convolution depths, and ensemble configurations.
Achieved >93% accuracy on held-out structural MRI volumes with strong stability under cross-scanner shifts due to aggressive normalization and augmentation.
Tools: PyTorch • MONAI • NiBabel • 3D CNNs • Bayesian Optimization • Medical Image Preprocessing
PPO-Based Reinforcement Learning for Autonomous Racing on AWS DeepRacer |
Link
Built and fine-tuned continuous-action PPO agents on AWS SageMaker for camera-based autonomous racing. Focused on reward shaping, action-space optimization, and stability constraints that reduce off-track drift and maximize progress-per-step. Achieved sub-2-minute lap times, reaching top global leaderboard ranks in 2024.
Trained end-to-end vision policies using clipped PPO (v4) with a shallow convolutional encoder and continuous steering/speed control.
Designed multiple reward families emphasizing centerline stability, heading agreement, curvature-aware waypoint tracking, and velocity-weighted progress.
Used distance-band shaping (0.1/0.25/0.5× track-width thresholds) to stabilize early learning and suppress divergence near edges.
Added steering smoothness constraints to reduce high-jerk trajectories while allowing aggressive straight-line acceleration.
Tuned PPO hyperparameters (entropy annealing, clipping ε, GAE λ, advantage normalization) to avoid policy collapse in long-horizon tasks.
Evaluated robustness under simulated perturbations via waypoint jitter, curvature sweeps, and speed-limit randomization.
Final optimized agent consistently produced <2 min laps, outperforming default baselines.
Autonomous Multi-Sensor Robot Simulation (GPS/IMU/LiDAR/2-DOF Vision) |
Link
A fully simulated 4-wheel autonomous robot equipped with GPS, 9-axis IMU, 2-D LiDAR, ultrasonic distance sensors, and a 2-DOF camera system (linear + rotary actuation). Implements global-position tracking, local mapping, object detection via camera streams, and reactive obstacle avoidance with minimal control logic. All sensing, actuation, and navigation behaviors are implemented inside the simulation stack.
4-wheel ground platform with independent velocity control for smooth turn/translation behavior.
GPS provides global (x,y) estimates; IMU provides orientation & angular velocity; LiDAR provides local ranging for free-space detection.
2-DOF camera module (linear rail + rotary joint) models active vision for object detection and viewpoint planning.
Distance sensors around the chassis give short-range obstacle feedback for collision-free local motion.
Robot supports simple teleoperation mappings (↑ ↓ ← → for locomotion; W/S/A/D for camera actuation) and autonomous wandering modes.
Designed as a baseline multi-sensor testbed for evaluating classic robotics behaviors without advanced SLAM or learning methods.
Repositories
Differential-Drive Kinematics & Odometry Robot
A two-wheel system where encoder increments produce linear and angular motion through standard differential-drive relations.
Wheel displacements follow
ΔsL=r,ΔϕL,ΔsR=r,ΔϕR,
with r=0.025,m.
Linear and rotational increments arise from
v=2ΔsL+ΔsR,ω=bΔsR−ΔsL,
where b=0.09,m.
Pose (x,y,θ) updates through
x′=x+vcosθ,y′=y+vsinθ,θ′=θ+ω.
Encoder drift directly affects integration of (Δx,Δy,Δθ), giving the accumulated trajectory purely from wheel motion.
Line-Follower Robot
A two-sensor contrast system that adjusts wheel velocities according to inequalities between left and right reflectance values.
Let IL and IR denote the two IR readings. Straight motion occurs when
∣IL−IR∣≈0.
Left steering triggered by
IL>IR,IL∈[Imin,Imax],
implemented by reducing or reversing the left wheel.
Right steering triggered by
IR>IL,IR∈[Imin,Imax],
applied symmetrically to the right wheel.
The motion law is a simple state determined by the ordering of sensor values:
IL≶IR;⟹;turn left/right,IL≈IR;⟹;forward.
Obstacle-Avoidance Robot
A proximity-based motion rule where wheel speeds depend on whether any sensor exceeds a threshold.
Six sensors yield values pi. Forward motion holds when
pi≤τ∀i,
for some threshold τ.
If any sensor satisfies
pj>τ,
the left wheel reverses, creating a turning motion away from the detected obstacle.
The velocity pair (vL,vR) is therefore piecewise:
Exploration emerges purely from repeated evaluation of maxipi and switching of wheel direction.
Wall-Follower Robot
A proximity-driven motion rule based on simple comparisons involving front-facing and left-side sensor values.
Let f denote the front sensor reading and ℓ the left sensor reading (threshold τ).
front wall: f>τ,left wall: ℓ>τ.
Turning in place arises when
f>τ,
implemented as
(vL,vR)=(vmax,−vmax).
Forward motion occurs when
f≤τ,ℓ>τ,
giving
(vL,vR)=(vmax,,vmax).
Right steering occurs when
f≤τ,ℓ≤τ,
with
(vL,vR)=(81vmax,,vmax).
Position estimates use
Δs=2sL+sR,θ=dsR−sL,
x′=x+Δscosθ,y′=y+Δssinθ,
allowing detection of when (x,y) enters the target region.
Robotrix-2k25 - Stereo Vision Based 3D Hoop Control |
Link
A simulation-based control project developed for the Robotrix-2k25 finals. A ball is shot in random directions and with varying forces, and the robot must reposition a 3-axis hoop to intercept it. Ball position cannot be accessed directly; only two stereo cameras mounted on the backboard are available. Ball 3D position is reconstructed via color segmentation, stereo disparity, and camera→world transforms, followed by 3D trajectory prediction and PID-driven actuator control.
Stereo Vision + 3D Reconstruction + Predictive Control Workflow
Detect the ball using HSV color filtering on two synchronized camera frames.
Extract pixel centroids from both sensors and compute disparity d=xl−xr.
Recover depth using the stereo pinhole model Z=(fB)/d.
Reconstruct (X,Y,Z) in camera coordinates and map into world frame using fixed transforms.
Estimate velocity from successive frames and predict future positions using projectile equations.
Command the hoop actuators through 3 PID controllers (X, Y, Z) to align with the predicted intercept point.
Loop at camera frame rate until shot completes.
Technical Summary
Ball detection uses HSV thresholding around the known orange color signature:
A vision-based system for automated grocery-item quality and quantity assessment. The solution integrates a unified dataset (multiple Roboflow sources aggregated and re-annotated) and a consolidated training run using a CNN-based detector (YOLOv7 backbone). The pipeline evaluates produce freshness, packaging correctness, text/OCR extraction, and item count/brand verification.
Core Capabilities Implemented
Constructed a merged multi-domain dataset (FMCG, produce, OTC, personal care, household items) using Roboflow pipelines; standardized annotations across label types.
Designed a complete smart vision quality pipeline following the GRiD specification
High-resolution image acquisition with normalization, light balancing, and noise suppression.
Preprocessing: brightness/contrast normalization, color correction, background segmentation.
OCR extraction using contour-guided ROI selection for brand name, pack size, label info, MRP, expiration dates.
Freshness scoring for fruits/vegetables via color-shift analysis, texture deviation metrics, spoilage cue detection, and geometric deformation checks.
Classification using deep CNN feature embeddings + auxiliary SVM for edge cases requiring shallow decision boundaries.
Brand recognition & count estimation using object-level shape/size features, multi-crop inference, and IR-style logical rules (simulated) as required by the event’s Use Case 3.
Run OCR confidence filtering and text-normalization passes.
Designed a decision engine cross-checking extracted attributes against the product database (brand, freshness index, label validity, count correctness).
Added a continuous data logging + feedback loop as recommended in event guidelines for improving classification reliability over time.
Submitted simulation videos demonstrating:
OCR output validation
Freshness detection on vegetables/fruits
Packaging/label integrity checks
Count & product-category recognition
Technical Summary
The system follows the GRiD 6.0 Smart Vision architecture:
Preprocessing Pipeline: Images undergo intensity normalization, edge-aware smoothing, and segmentation to isolate foreground products. This supports text regions, geometric features, and surface attributes required for OCR and quality scoring.
Feature Extraction: Text regions are processed using OCR; geometric features (edges, contours, size ratios), color-space transformations, and texture descriptors support defect/freshness detection. Deep CNN embeddings from the trained model are used for brand/category classification, while SVM layers assist with high-similarity items.
Classification and Decision Rules: Outputs are checked against a product database for correctness. Freshness of produce uses color variance, texture irregularities, bruise signatures, and abnormal shape metrics. Count estimation uses object-level consistency checks aligned with the event’s “IR-based counting” specification.
Output & Feedback: Detected attributes (brand, count, OCR text, expiry date, freshness index) are logged. A feedback loop stores misclassified samples for incremental dataset improvement.
Hey, I'm Jagadeesh
I work mainly on digital/analog hw design: RTL, ckt-lvl & basic arch
Learning open-source IC flows; building/testing small HW blocks
I also work with MCU/SBCs in basic sys-design for robotics/rototyping
Active in comps, open-source, hw/sw co-design challenges
MY Key Interests:
- Analog and Digital Circuit Design
- RTL Verification, Debugging, and Validation
- Compute and Signal Processing Systems
- Amplifiers, Oscillators, and Control
- Hardware Architecture and Microarchitecture
- Robotics and Computer Vision
Feel free to check out my projects above,
├─nd if you’re interested in collaborating/discussing HW/ML/robotics,
└─let’s connect :)
வணக்கம் உலகம்! This is Jagadeesh.
$ ls ~/projects -filter=feat(Click sections below to expand)
Digital/Physical Design/Verfication & Compute Architectures
INT8 Fixed-Point CNN Hardware Accelerator and Image-Processing Suite | Link
INT8 Fixed-Point CNN Hardware Accelerator and Image-Processing Suite
• Designed a synthesizable shallow Res-CNN for CIFAR-10, Pareto-optimal among 8 CNNs for parameter memory, accuracy & FLOPs
• Built systolic-array PEs with 8-bit CSA–MBE MACs, FSM-based control, 2-cycle ready/valid handshake, and verified TB operation
• Performed PTQ/QAT (Q1.31→Q1.3) analysis; Q1.7 PTQ retained ∼84% accuracy (<1% loss) with 4x smaller (∼52kB) memory footprint
• Auto-generated 14 coeff & 3 RGB ROMs via TCL/Py automation; validated TF/FP32–RTL consistency and automated inference execution
• Implemented AXI-Stream DIP toolkit (edge, denoise, filter, enhance) with pipelined RTL & FIFO backpressure handling
• MLP classifier on (E)MNIST (>75% acc.) with GUI viz; Automated preprocessing & inference with TCL/Perl
3-Stage Pipelined Systolic Array-Based MAC Microarchitecture
• Benchmarked six 8-bit signed adders-multipliers via identical RTL2GDS Sky130 flow to isolate arithmetic-level post-route PPA trade-offs
• 3-stage pipelined systolic MAC (CSA-MBE), achieving ↓66.3% delay; ↑3.1× area efficiency; ↓82.2% typical power vs naïve conv3 baseline
• Used a 2D PE-grid structure for convolution (verified 0/same padding modes) and optimized GEMM (reducing power by 44.6%; N = 3)
• Added a 648-bit scan chain across all pipeline/control registers, enabling full DFT/ATPG testability with only +14.5% cell overhead
Click here for more details
Design of High-Performance Q1.7 Fixed-Point Quantized CNN Hardware Accelerator with Microarchitecture Optimization of 3-Stage Pipelined Systolic MAC Arrays for Lightweight Inference
A fully synthesizable INT8 CNN accelerator for CIFAR-10, built around a 3-stage pipelined systolic-array MAC microarchitecture optimized from a 6-design PPA benchmark. Includes complete quantization workflow (PTQ/QAT), 2-cycle ready/valid protocol, ROM automation, FP32↔RTL accuracy checks, and a hardware image-processing suite
Current Project Overview
Duration: Individual, Ongoing
Tools: Verilog (Icarus Verilog, Yosys) | Python (TensorFlow, NumPy) | Scripting (TCL, Perl)
8-bit Quantized CNN Hardware Accelerator: Open-source, Modular, & Optimized for Inference
Project Link Verilog | Basic Architecture | Digital Electronics
Designed a fully synthesizable shallow Res-CNN for CIFAR-10, evaluated against eight reference CNNs, and achieved a Pareto-optimal trade-off across throughput, latency, and accuracy.
Implemented systolic-array processing elements using 8-bit CSA–MBE MAC units with FSM-driven control logic and a 2-cycle read/valid handshake. Verified end-to-end datapath behavior through structured testbenches.
Performed detailed post-training quantization (PTQ) and quantization-aware training (QAT) studies. Quantizing from Q1.31 to Q1.3 allowed exploration of precision vs. accuracy trends, and a Q1.7 PTQ model retained ~84% accuracy (less than 1% drop) while reducing the model memory footprint by 4× (≈52 kB).
Developed automated scripts (TCL + Python) to generate 14 coefficient ROMs and 3 RGB input ROMs, enabling seamless hardware ingestion of model parameters and images. Verified TensorFlow / FP32 ↔ RTL output consistency and automated full-pipeline inference execution.
Built a compact digital image-processing toolkit (edge detection, denoising, filtering, enhancement) and an MLP classifier for (E)MNIST datasets. Achieved >75% accuracy with real-time GUI visualization for interactive experimentation.
Technical Summary
Designed a fully synthesizable INT8 CNN accelerator (Q1.7 PTQ) for CIFAR-10, optimized for throughput, latency determinism, and precision efficiency. Implemented a 2-cycle ready/valid handshake for all inter-module transactions and FSM-based control sequencing for deterministic pipeline timing. Trained 8 CNNs (TensorFlow, identical augmentation & LR scheduling w/ vanilla Adam optimizer) and performed architecture-level DSE via Pareto analysis, selecting 2 optimal variants including a ResNet-style residual CNN.
PTQ/QAT comparisons were conducted across Q1.31, Q1.15, Q1.7, and Q1.3; Q1.7 PTQ (1-int, 7-frac | 0.0078 step) gave the best accuracy–memory trade-off { Q1.31 ~84% ~210kB | Q1.7 ~83% ~52kB | Q1.3 ~78% ~26kB }, achieving ~84% top-1 accuracy, <1% loss, and ≈52 KB total (≈17×3 KB RGB inputs)
The 3-stage pipelined systolic-array convolution core employs Processing Elements (PEs) built around MAC units composed of 8-bit signed Carry-Save Adders (CSA) and Modified Booth-Encoded (MBE) multipliers, arranged in a 2D grid for high spatial reuse and single-cycle accumulation. All 14 coefficient ROMs and 3 RGB input ROMs were auto-generated via a Python/TCL automation flow handling coefficient quantization, packing, and open-source EDA simulation. Verified bit-accurate correlation between TensorFlow FP32 and RTL fixed-point inference layer-wise; an IEEE-754 single-precision CNN variant validated numeric consistency.
Integrated image-processing modules (edge detection, denoising, filtering, contrast enhancement) form a Verilog-based hardware preprocessing pipeline, feeding an MLP classifier evaluated on the (E)MNIST (52+)10 ByClass datasets. The MLP shares the preprocessing and automation flow, with an additional IEEE-754 64-bit FP variant for precision benchmarking. A Tkinter GUI enables interactive character input, and preprocessing visualization via Matplotlib
High-Speed 3-Stage Pipelined Systolic Array-Based MAC Architectures
Digital Logic Design | Synthesis
Benchmarked six different 8-bit signed adder–multiplier architectures using PPA metrics (latency, throughput, and area) on the Sky130 PDK, and analyzed their architectural trade-offs in the context of low-power convolution workloads.
Designed a 3-stage pipelined systolic MAC based on a CSA–MBE multiplier structure, achieving substantial improvements over a naïve conv3 baseline: 66.3% lower delay, 3.1× higher area efficiency, and 82.2% lower typical power.
Implemented a 2D systolic PE-grid supporting general convolution and GEMM operations, with verified behavior under zero-padding and same-padding configurations. GEMM optimization reduced power consumption by 44.6% for matrix dimension (N = 3).
Integrated a 648-bit scan chain spanning all pipeline and control registers, enabling full DFT/ATPG support with only 14.5% cell-area overhead, ensuring manufacturability and high test coverage.
Technical Summary
Benchmarked six 8-bit signed adder and multiplier architectures for systolic-array MACs targeting CNN/GEMM workloads using a fully open-source ASIC flow (Yosys + OpenROAD/OpenLane) on the Google-SkyWater 130nm PDK (Sky130HS PDK @25°C_1.8V). Evaluated PPA (Power, Performance, Area) and latency/throughput/area metrics under a constant synthesis and layout environment with fixed constraints and floorplan parameters (FP_CORE_UTIL = 30 %, PL_TARGET_DENSITY = 0.36, 10 ns clock, CTS/LVS/DRC/Antenna enabled)
Adders:
Multipliers:
Final MAC integrates an 8-bit signed CSA adder and 8-bit signed MBE multiplier in a 3×3 convolution/GEMM core using a 3-stage pipelined systolic array (sampling → truncation/flipping → MAC accumulation). Verified via RTL testbench and post-synthesis timing across zero/same-padding modes. Automated GDS/DEF generation and PPA reporting for all architectures ensured fully reproducible, environment-consistent results
Comparative study (Small Scale Ops) :
CSA-MBE pair Systolic Array Conv vs Naïve Conv (3x3 Kernel on 5x5 Image @CLK_PERIOD_20_ns)
Single MAC re-use vs Systolic 4-PE Grid (2x2 matrix multiplication)
Single MAC re-use vs Systolic 9-PE Grid (3x3 matrix multiplication)
DFT (Scan Chain) Add-On
Integrated a 648-bit full-scan chain across all pipeline, control, and output registers in the 3×3 systolic MAC/convolution datapath. Every state element is replaced with a single-bit scan DFF (SE/SI/SO), enabling serial load/unload of internal state.
Flattened register groups (kernel flip buffer, px16/px8 slices, ker8 slices, prod_s2 pipeline, row/col counters, valid pipe, output registers, done flag) into one contiguous chain, ensuring deterministic bit ordering and simple scan stitching.
Verified scan behavior through shift–capture–shift TB patterns and ensured functional transparency: scan mode (
SE=1) freezes datapath updates, while normal mode (SE=0) preserves original functionality.Total overhead from scan insertion is ~14.5% in standard cells, with no change to functional timing or systolic pipeline throughput at the target clock.
Repositories
ViSiON – Verilog for Image Processing and Simulation-based Inference Of Neural Networks
This repo includes all related projects as submodules in one place
ImProVe – IMage PROcessing using VErilog: A collection of image processing algorithms implemented in Verilog, including geometric transformations, color space conversions, and other foundational operations.
NeVer – NEural NEtwork on VERilog: A hardware-implemented MLP in Verilog for character recognition on (E)MNIST, alongside a lightweight CNN for CIFAR-10 image classification
MOVe – Math Ops in VErilog
CORDIC Algorithm – Implements Coordinate Rotation Digital Computer (CORDIC) algorithms in Verilog for efficient hardware-based calculation of sine, cosine, tangent, square root, magnitude, and more.
Systolic Array Matrix Multiplication – Verilog implementation of matrix multiplication using systolic arrays to enable parallel computation and hardware-level performance optimization. Each processing element leverages a Multiply-Accumulate (MAC) unit for core operations.
Hardware Multiply-Accumulate Unit – Implements and compares 8-bit multipliers and 8-bit adders in synthesizable Verilog, analyzing their area, timing, and power characteristics in MAC datapath architectures.
Posit Arithmetic (Python) – Currently using fixed-point arithmetic; considering Posit as an alternative to IEEE 754 for better precision and dynamic range. Still working through the trade-off.
Storage and Buffer Modules
RAM1KB – A 1KB (1024 x 8-bit) memory module in Verilog with write-once locking for even addresses. Includes a randomized testbench. Also forms the base for a ROM3KB variant to store 32×32 RGB CIFAR-10 image data.
FIFO Buffer – Synchronous FIFO with param. depth, single clock domain, and standard full/empty flag logic.
RISC-V & MIPS Microarchitectures - SC / MC / Pipelined / Dual-Issue Superscalar | Link
• Built a two-wide in-order superscalar RISC processor with parallel IF–ID–EX–MEM–WB lanes and independent pipeline registers per lane
• Designed dual 16-bit instruction fetch per cycle with inter-lane dependency checks, hazard suppression, and load-use stall handling
• Implemented 4R2W register file, RAW/WAW detection, branch squashing, & multi-port memory for concurrent fetch and data access
• Evaluated SC/MC/5-stage pipelined designs via directed programs (assembly with RISCV GCC Toolchain & QEMU reference), analyzing CPI (1/3.8/1.6), cycle counts, and hazard overhead
Click here for more details
RV32I RISC-V Core (TL-Verilog, Single-Cycle Implementation)
Tools: Makerchip | TL-Verilog | Verilator
x0 = 0invariance.x30andx31.RV32I RISC-V Core (Verilog, Single-Cycle Implementation)
Tools: Verilog | Icarus Verilog | ModelSim | Quartus Prime
LB, LBU, LH, LHU, LW, SB, SH, SW, with correct zero/sign extension and RMW correctness.MIPS Microarchitectures - SC / MC / 5-Stage Pipeline
Tools: Verilog | Icarus Verilog | ModelSim | GTKWave
Implemented three 32-bit MIPS processors:
Single-Cycle: CPI = 1.0, PC increments by +4 each cycle.
Multi-Cycle: Instruction class cycle counts:
Pipeline (IF–ID–EX–MEM–WB): forwarding, hazard detection, 1-cycle load-use stall, 1-cycle taken-branch flush. → Benchmark CPI ≈ 1.1–1.2
Executed Harris & Harris benchmark (18 instructions). Correctly wrote 0x00000007 to memory addresses 0x50 and 0x54.
Included self-checking benches,
.memloading infrastructure, full waveforms, and verification logs.Comparison Table RISCV
Comparison Table MIPS
Dual-Issue 16-bit RISC Superscalar Processor (In-Order)
Tools: Verilog | Icarus Verilog | GTKWave
r0=0hardwired.Repositories
Pipelined ALU with Scan-Chain Integration | Link
• Designed non-pipelined/pipelined/scan-enabled 4-stage ALU; pipeline FFs replaced with scan FFs for scan-in/capture/scan-out
• Gate-level timing analysis in Yosys/OpenSTA (Sky130) with clock uncertainty, I/O delays, & input slew; ∼1.7x fmax gain with pipelining
• RTL2GDS flow: scan vs no-scan, single vs dual scan, CTS skew tightening, util/density & floorplan stress; closed timing throughout
• Analyzed IO-driven routing effects; worst-case pinning increased clock wire length by >2x despite CTS/placement optimization
• Recovered clock routing via pin-arch optimization (>50% clk WL ↓); PDN stress at signoff showed +21% total & +58% switching power
Click here for more details
A small arithmetic and comparison unit used to quantitatively study the effects of pipelining, scan-chain insertion, and physical design constraints on area, timing, routing, clocking, and power. The project compares RTL-only variants using Yosys and OpenSTA, and extends the scan-pipelined design through full RTL-to-GDS physical design using OpenLane (Sky130).
Design Variants
All variants implement identical arithmetic and comparison logic; only register, scan, and physical constraints differ.
Synthesis & Area Results (Yosys)
Area overhead
Scan insertion adds ~3 logic gates per flip-flop, visible as increased NAND/OR cell counts.
Timing Results (OpenSTA, baseline constraints)
Hold timing is clean in all variants; scan logic increases minimum delay and improves hold margin.
Pessimistic STA (Interface-Aware Constraints)
Constraints added:
Physical Design Evolution (RTL → GDS, OpenLane / Sky130)
The scan-pipelined variant was progressively stressed and optimized through ten controlled PD experiments (E1–E10). Each experiment modified one dominant physical knob, while maintaining a 4.0 ns clock and clean signoff unless explicitly stated.
E1–E10 PD Experiment Summary
Routing Geometry Across PD Stages
Timing Closure (Post-Route / Signoff)
Power Impact (Signoff, RCX)
CTS-stage power remained unchanged; PDN stress manifested only at signoff.
Extended Key Takeaways (RTL + PD)
Repository
AHB–APB Bridge with Self-Checking Verification | Link
• Designed a parameterizable AHB-Lite to APB bridge with FSM-based control supporting single & burst read/write transactions
• Implemented address/data latching, write buffering, read return, and burst sequencing, handling pipelined and non-pipelined accesses
• Built a self-checking SV testbench with macro-controlled test modes (single/burst R/W) and assertion-based data validation
• Verified protocol correctness across all transaction types; additionally designed & verified standalone (I2C/SPI/UART) peripheral controllers
Click here for more details
A parameterizable AHB-Lite to APB bridge implemented in SystemVerilog, with a self-checking verification environment. The design translates AHB transactions into APB protocol sequences using FSM-based control and supports both single and burst transfers.
AHB–APB Bridge – RTL Design & Verification
Tools: SystemVerilog | Icarus Verilog | GTKWave
Designed a configurable AHB-Lite → APB bridge supporting single and burst read/write transactions.
Implemented FSM-based control logic handling address/data latching, write buffering, read-data return, and burst sequencing.
Supports pipelined and non-pipelined AHB accesses, with correct handling of transaction type and transfer continuation.
Built a self-checking SystemVerilog testbench with compile-time macro selection for:
Integrated assertion-based data validation, protocol correctness checks, and automatic PASS/FAIL reporting.
Verified functionality using Icarus Verilog, with per-test VCD waveform generation for debug and inspection.
Repository
Design & Formal Verification of Parameterizable Fixed-Point CORDIC IP | Link
• Implemented shift-add datapath with all 6 modes rotation/vectoring (circular/linear/hyperbolic); width/iter/angle frac/output width–shift scaling swept across configs
• Built trig/mag/atan2/mul/div/exp wrappers; observed ∼e-5 RMS (@32b, 16iter) baseline vs double-precision references
• Proved handshake, deadlock-free bounded liveness, range safety, symmetry & monotonicity via SystemVerilog assertions (SymbiYosys/Yices2)
• Auto-generated atan tables & param files via Python; FuseSoC-packaged core with documented sensitivity, error trends & failure regions
• Built drop-in core variants (pipelined/SIMD/multi-issue); implemented a QAM16 demodulator using the CORDIC core
Click here for more details
A fully synthesizable, parameterizable fixed-point CORDIC soft IP supporting rotation and vectoring modes, with dedicated wrappers for trigonometric and vector operations.
The project emphasizes formal verification of control, protocol, and mathematical structure, complemented by simulation-based accuracy characterization and parameter sensitivity analysis.
Project Overview
Duration: Individual
Tools: Verilog | SystemVerilog Assertions (SVA) | SymbiYosys (Yices2) | Icarus Verilog | Python | FuseSoC
CORDIC IP – Architecture & Capabilities
Implemented a shift–add CORDIC core supporting 6 operating modes, covering
2 algorithms × 3 coordinate systems:
Algorithms
Architectures
This enables the following functional coverage:
sin,cos,tan,atan2, magnitudemult,divsinh,cosh,tanh,exp,lnmultdivsin,cos,tanatan2,magnitudesinh,cosh,tanh,explnDesigned a single iterative datapath reused across all modes, with:
ITER + 1cycles)start / busy / validhandshakeFully parameterized design:
Built standalone wrapper modules for common functions that apply:
while preserving a uniform control interface across all operations.
Extended functionality beyond trigonometric primitives to include:
exp,lnmult,divsinh,cosh,tanhachieving ~1e-5 absolute precision under typical configurations.
Auto-generated lookup tables and parameter headers using Python for:
Packaged the core and wrappers using FuseSoC, with documented:
Accuracy Characterization & Parameter Sensitivity
Circular rotation mode (sin/cos) converges exponentially with iteration count until limited by fixed-point quantization.
With
WIDTH = 32,ITER = 16, and 30-bit angle precision:Circular vectoring mode (magnitude) shows similar convergence:
WIDTH = 32,ITER = 16Linear mode (mult/div) reaches target precision rapidly:
Hyperbolic mode (exp, sinh, cosh, tanh, ln):
Tangent and atan2 remain numerically ill-conditioned:
tan(x)error dominated by1 / cos(x)near ±π/2atan2(y,x)sensitive near(x,y) → (0,0)Identified and documented deterministic numerical failure regions:
Formal Verification
The core iterative datapath and control logic have been formally verified:
Formal analysis is focused on the base CORDIC iteration engine and control FSM.
Extended mathematical functions (linear and hyperbolic modes, transcendental wrappers) are designed and validated through simulation and numerical analysis, and are intentionally out of scope for formal proof.
Drop-IN CORDIC Cores
Family A — Control-Oriented / Iterative
Family B — Throughput-Oriented / Feed-Forward
Family C — SIMD / Spatial / Vectorized
Family D — Multi-Issue / Time-Interleaved
Technical Summary
The CORDIC core implements an iterative micro-rotation algorithm using only shifts and additions.
Internal state updates follow the standard recurrence:
xi+1=xi−di,(yi≫i)
yi+1=yi+di,(xi≫i)
zi+1=zi−di,arctan(2−i)
di=+1 if zi≥0
di=−1 if zi<0
xi+1=xi+di,(yi≫i)
yi+1=yi−di,(xi≫i)
zi+1=zi+di,arctan(2−i)
di=+1 if yi≥0
di=−1 if yi<0
The datapath is fully parameterized by:
WIDTH)ITER)Wrappers & Scaling Semantics
Wrappers initialize the core with gain-compensated vectors and extract results via deterministic right shifts:
sin / cos: finaly / xoutputstan: extended-precision ratio outputmag: finalxmagnitudeatan2: accumulated angle outputCorrect scaling is mandatory; mismatched shifts reproducibly cause saturation, amplitude loss, or numerical collapse.
Formal Verification Strategy
Formal verification focuses on structural correctness, not numerical accuracy.
Common properties (all modules):
startandvalid≤ ITER + 1cycles)busyRotation mode assertions:
sin(-x),cos(-x))Vectoring mode assertions:
All properties are written in purely parametric SVA, proven using SymbiYosys + Yices2, and scale automatically with configuration.
Numerical accuracy is intentionally validated via simulation and reference comparison, not formal proof.
Repository
Peripheral Serial Communication Protocols - I2C / SPI / UART-TX Link
Implemented a collection of peripheral serial communication interfaces in Verilog, focusing on synthesizable, parameterizable controllers suitable for FPGA/ASIC integration. Each protocol includes a cleanly modularized interface, configurable timing parameters, and testbench-driven validation with waveform inspection.
I2C Master Controller – Implements a single-master, multi-slave I²C bus supporting standard-mode timings, programmable SCL low/high periods, ACK/NACK handling, and clock stretching detection. Features deterministic START/STOP generation, byte-wise transfers, and address+data framing logic.
SPI Master (Modes 0–3) – Supports CPOL/CPHA mode selection, 8-bit full-duplex transfers, configurable SCLK division, selectable slave-select behavior, and MSB-first shifting. Designed with a compact FSM and separate TX/RX shift registers for predictable cycle behavior.
UART TX Soft-Core IP – Lightweight serial transmitter with baud-rate generator, start/stop framing, data-valid gating, and FIFO-less single-byte serialization. Fully synthesizable and intended as a drop-in peripheral for SoCs, teaching cores, or FPGA peripheral sets.
Repositories
Basic Python Tool for ISCAS’85/’89 Benchmark Analysis & Fault-Modeling | Link
A work-in-progress open-source Python package implementing foundational DFT/Fault-modeling utilities for ISCAS’85 and ISCAS’89 benchmark circuits.
Includes automatic Verilog netlist generation, random testbench creation, serial/parallel fault simulation, fault collapsing, SCOAP metric computation, and initial ATPG experimentation. PODEM implementation is under development.
.v) files and matching randomized testbenches from ISCAS netlists; injects stuck-at faults using ID-indexedassignoverrides appended to the generated modules.Repositories
Predictive ML EDA Framework for Routability Congestion Prediction | Link
WORK IN PROGRESS
Repository
FIR Accelerator for Microwatt | OpenPOWER Hardware Hackathon Link
A parameterizable FIR filter acceleration block designed for Microwatt/OpenFrame systems. Implements a sequential multiply accumulate datapath, programmable coefficients, and a clean Wishbone-Lite register interface for CPU control. The design was accepted for potential fabrication during the hackathon review. The final taped-out submission was not completed due to timing constraints.
FIR Accelerator - Sequential Fixed-Point Filtering Unit
Duration: Hackathon submission Tools: Verilog | Icarus Verilog | Microwatt/OpenFrame
Technical Summary
The FIR accelerator directly implements the discrete-time convolution
y[n]=∑k=0N−1h[k]x[n−k]
using a cycle-by-cycle accumulation loop driven by an internal finite-state controller.
Internal State Representation
Let
Sequential MAC Update
Each tap contributes Ai+1=Ai+Xi⋅Hi.
At the start of each computation A0=0.
Output Formation
After all N taps have been processed the output is y[n]=satW(AN), where satW(⋅) denotes saturation to the configured output bit width W.
Control Semantics
The number of cycles per output is exactly N+c, where c is a small constant FSM overhead.
Fixed-Point Behavior
Let input samples use format Qd and coefficients use Qc. The accumulator width satisfies
AW≥d+c+⌈log2N⌉
to avoid intermediate overflow before the final saturation step. No dynamic scaling or normalization is applied; the arithmetic is strictly linear.
Repository
ASIC RTL2GDS Flow Projects | Link 1 | Link 2
A collection of beginner-friendly ASIC design experiments where I am still learning the full RTL to GDS flow. I automated synthesis steps, generated schematic views, pushed complete flows through OpenLane and OpenROAD, produced layout snapshots, and verified the final GDS outputs. Both designs behave as compact test vehicles for understanding physical design stages and polishing my flow setup.
ASIC RTL2GDS Practice Projects
Duration: Individual Tools: Yosys | OpenLane | OpenROAD | Magic | KLayout
Repository Cards
SHA256 Core Functional Verification | Link
A focused functional verification environment built around the open-source SHA256 core from secworks. I created a structured suite of directed, random, corner-case, and negative fail-case tests, automated through a TCL-driven flow that compiles, runs, checks, and aggregates results. The goal was to push the core through a wide coverage surface, verify digest correctness, exercise interface timing, and validate the robustness of the verification infrastructure itself.
Verification Summary
Duration: Individual Tools: Verilog | Icarus Verilog | TCL automation
Technical Summary
The SHA-256 core is verified directly against the mathematical definition of its compression function.
Each testbench checks that the DUT produces digests matching the iterative hashing rule
H(0)=HIV
and for each message block M(i):
H(i+1)=F(H(i),M(i))
Round Computation
Each round computes two temporary values:
T1=h+Σ1(e)+Ch(e,f,g)+Kt+Wt
T2=Σ0(a)+Maj(a,b,c)
Then the working registers update as:
h=g
g=f
f=e
e=d+T1
d=c
c=b
b=a
a=T1+T2
Message Schedule
For each block the scheduler generates:
Wt=Mt(t<16)
Wt=σ1(Wt−2)+Wt−7+σ0(Wt−15)+Wt−16(t≥16)
Verification Focus
The environment validates:
digest_validassertsTCL Automation
The TCL flow executes a deterministic verification loop:
This provides a repeatable regression pipeline for SHA-256 functional verification.
Verification Coverage
Automation Workflow
The TCL flow executes the entire suite:
This produces structured outputs with test names, expected vs actual digests, and overall regression status.
Repository Card
Analog Circuits & Device-Level Design
Device Modeling using Sentaurus TCAD | Link
Performed semiconductor device modeling using Synopsys Sentaurus for foundational structures including N-type resistors, PN diodes, and NMOS transistors. Explored how doping profiles, junction depths, geometry parameters, and physical models impact device characteristics through calibrated simulations and scripted workflows.
Overview
.cmdand.desscripts for sweeping doping levels, voltages, and geometry parameters.Repository
CMOS Inverter Layout (Magic VLSI) & Ngspice Simulation | Link
A complete CMOS inverter implementation built using Magic VLSI (SCMOS) for physical layout and ngspice for extracted-device simulation.
Covers device construction rules under the SCMOS process, physical layout of PMOS/NMOS devices, contact/tap structures, parasitic-aware extraction, and transient analysis of inverter switching characteristics.
The layout follows the SCMOS ruleset:
Extraction produces a transistor-level
.spicenetlist including geometry-derived parasitics.Transient simulation evaluates:
DC Analysis Results
Transient Analysis Results
The repository includes the Magic layout (
.mag), extracted netlists, wrapper files for stimulus, and generated ngspice waveforms.Repository
Two-Stage CMOS Operational Amplifier with Miller Compensation | Link
A two-stage CMOS op-amp designed in TSMC 180 nm, using an NMOS differential input pair with PMOS current-mirror load, followed by a common-source second stage. Frequency compensation is implemented using a Miller capacitor between the first-stage output and the second-stage output node, producing dominant-pole behavior and stable unity-gain operation.
Device dimensions were set from closed-form analog constraints:
Simulation results:
Operating-point analysis confirms all MOS devices remain in saturation with expected overdrive values, and both transient and AC characteristics match analytical pole/zero predictions for a Miller-compensated two-stage topology.
Repository
5-Stage CMOS Ring-Oscillator VCO | Link
A 5-stage CMOS inverter ring used as a voltage-controlled delay line, producing oscillation whose frequency scales with the control voltage. A 3-stage buffer isolates the oscillator core and restores the internal sine-like waveform into a full-swing CMOS square wave.
The oscillator operates from 0.7–3.0 V control input and shows a monotonic delay reduction with increasing drive strength.
Measured characteristics
Frequency points
Repository
Analog Function Generator with Adjustable Amplitude/Offset/Phase | Link
A multi-waveform analog function generator built using discrete op-amp blocks (TL082), passive RC networks, and a CD4051 analog multiplexer. The generator produces sine, square, and triangular outputs and exposes continuous control of amplitude, DC offset, and phase. Additional AM/PM blocks and a relaxation-oscillator VCO extend the system for modulation experiments.
The signal path is fully modular-each block is buffered to avoid inter-stage loading errors, enabling predictable behavior across a 1 kHz–500 kHz operating band.
Measured characteristics
Signal-generation architecture
Representative measurements
Repository
Audio Power Amplifiers (Class A, B, AB) | Link
Design, analysis, and LTspice simulation of audio power amplifier classes, combining analytical power calculations, circuit-level simulation, and audio-domain validation. Digital audio signals are preprocessed using FFmpeg and injected directly into LTspice for waveform-accurate listening tests.
Scope
Key Results
The repository includes LTspice schematics, FFmpeg preprocessing workflow, analytical calculations, and simulation/audio outputs.
Repository
CMOS Bandgap Reference (DC, Temp, Line, Monte Carlo) | Link
LTspice simulation and characterization of a CMOS bandgap reference implemented in an OSU 180 nm CMOS process, with emphasis on Monte Carlo mismatch analysis alongside functional verification. The work validates temperature compensation, line regulation, startup behavior, and statistical variation of Vref.
Scope
Monte Carlo Results (100 Runs)
The results indicate that reference accuracy is dominated by passive matching, consistent with an untrimmed CMOS bandgap implementation. No curvature correction or trimming was applied.
The repository includes LTspice schematics, simulation directives, waveform data, and Monte Carlo result plots.
Repository
Precision PID Controller Design using Operational Amplifiers | Link
An analog PID controller built using high-linearity op-amps (LT1007 / TL082) and RC networks, implemented entirely in continuous time and validated through LTspice. The design focuses on stable low-frequency integration, controlled differentiation without noise peaking, and diode-based output limiting for robust transient behavior.
Two complete controller variants were implemented-one minimal, one extended with gain scaling and anti-windup.
Measured / designed characteristics
Differential stage: unity-gain differential amplifier with high CMRR for clean error sensing
Integrator: 10 ms time constant →
Derivative network: RC shaping with controlled high-frequency roll-off to prevent noise amplification
Output swing protection: diode clamps maintaining bounded actuation signal under large transients
Op-amp choices: LT1007 for low noise and precision; TL082 as a low-cost, wide-bandwidth alternative
Simulation: full closed-loop Bode, transient, load-step and saturation recovery tests in LTspice
Second PID variant
Design intent
Repository
Robotics and Applied ML (Secondary)
ANAV for Martian Surface Exploration / GNSS-Denied Environments (ISRO IRoC-U 2025) | Link
A sub-2 kg autonomous quadrotor designed for GNSS-denied navigation, visual–inertial localization, mapping, and safe-zone landing, using onboard compute, stereo sensing, and redundant measurement sources.
Duration: Team-Based (ISRO RIG), Ongoing
Tools: Jetson Nano | Pixhawk 4 | RealSense D435i | ESP32 (ESP-Now) | ORB-SLAM3 | VINS-Fusion | ROS2
Autonomous Quadrotor for GPS-Denied Operation
Technical Summary
Integrated Jetson Nano with Pixhawk 4 for onboard computation and flight handling, with calibrated ESCs and thrust mapping ensuring stable lift and attitude control for a <2 kg platform. Power regulation used a 5 V / 3 A BEC, isolating sensor/compute loads from motor-induced voltage drops.
Performed extrinsic and intrinsic calibration for the RealSense D435i (stereo + IMU) and aligned timestamps between Jetson and Pixhawk sources. Evaluated VIO accuracy using VINS-Fusion and ORB-SLAM3, testing sensitivity to feature density, motion blur, low-texture floors, and illumination. Achieved <5 cm drift over ~5 m sequences with optimized IMU noise parameters and RANSAC thresholds.
Connected barometer, optical-flow, and external sensors to Pixhawk over I2C/UART. Configured EKF2 to combine IMU, barometer, and flow when stereo data deteriorates. Implemented consistency checks between VIO and Pixhawk position estimates; deviations above a fixed threshold (~8–10 cm) trigger reliance on flow + barometer only.
Implemented long-range ESP-Now telemetry between two ESP32 modules. Achieved ~500 m line-of-sight operation and <15 ms median latency. Data included estimated position, VIO confidence, EKF residuals, battery, and attitude.
Developed a method for landing region selection using disparity maps and IMU tilt. Evaluated a 1.5 m × 1.5 m safe area requirement; system rejected regions with irregular height profiles or slopes >15°. Confirmed consistent landings on textured and partially textured surfaces.
Conducted GNSS-denied simulations in Webots, setting gravity to 0.38 g to approximate Martian conditions. Assessed altitude holding, drift accumulation, and safe-area approach across multiple terrains. Logged estimator drift, thrust reserve, and landing dispersion to validate repeatability under constrained sensing.
Repositories
ANAV – ISRO IRoC-U 2025 Autonomous Drone System
RU83C – Rubik’s Cube Solving Robot | Link
A computer-vision–driven Rubik’s Cube solver built around color detection, face reconstruction, and algorithmic solution generation. The system extracts cube state using calibrated imaging and solves it via a Kociemba two-phase search, which operates over the full 43,252,003,274,489,856,000 (~4.3×10¹⁹) state space of a standard 3×3 cube.
Vision Processing and Cube State Extraction
Performed HSV-based per-face calibration with adjustable saturation/value envelopes to stabilize under variable illumination.
Applied contour filtering and grid isolation after morphological denoising to lock onto a valid 3×3 cell arrangement.
Executed homography-based perspective correction and grid segmentation, assigning colors by mean-HSV dominance.
Combined all six captures into a canonical 54-character cube state string, checked for:
Solution Generation and Simulation
Used a Unity visualization environment for state verification, stepwise execution, and intermediate-move replay.
Integrated Kociemba’s two-phase algorithm with explicit details:
Typical generated solutions fall in the 18–22 move range (quarter-turn metric), with occasional optimal-length sequences for favorable states.
Viewer supports interactive updates, solution playback, and direct manipulation of state representations.
Repositories
MRI-Based Alzheimer’s & MCI Classification using 3D CNNs | Link
Implemented a full 3D medical-imaging classification pipeline for Alzheimer’s, MCI, and cognitively normal subjects using PyTorch/MONAI.
Focused on volumetric preprocessing, stable normalization across scanners, and architecture search over 3D convolutional backbones.
Tools: PyTorch • MONAI • NiBabel • 3D CNNs • Bayesian Optimization • Medical Image Preprocessing
PPO-Based Reinforcement Learning for Autonomous Racing on AWS DeepRacer | Link
Built and fine-tuned continuous-action PPO agents on AWS SageMaker for camera-based autonomous racing.
Focused on reward shaping, action-space optimization, and stability constraints that reduce off-track drift and maximize progress-per-step.
Achieved sub-2-minute lap times, reaching top global leaderboard ranks in 2024.
Tools: AWS SageMaker • DeepRacer Simulator • Clipped PPO • Continuous RL • Policy Gradient Optimization
Repositories
Autonomous Multi-Sensor Robot Simulation (GPS/IMU/LiDAR/2-DOF Vision) | Link
A fully simulated 4-wheel autonomous robot equipped with GPS, 9-axis IMU, 2-D LiDAR, ultrasonic distance sensors, and a 2-DOF camera system (linear + rotary actuation).
Implements global-position tracking, local mapping, object detection via camera streams, and reactive obstacle avoidance with minimal control logic.
All sensing, actuation, and navigation behaviors are implemented inside the simulation stack.
Repositories
Differential-Drive Kinematics & Odometry Robot
A two-wheel system where encoder increments produce linear and angular motion through standard differential-drive relations.
ΔsL=r,ΔϕL,ΔsR=r,ΔϕR,
with r=0.025,m.v=2ΔsL+ΔsR,ω=bΔsR−ΔsL,
where b=0.09,m.x′=x+vcosθ,y′=y+vsinθ,θ′=θ+ω.
Line-Follower Robot
A two-sensor contrast system that adjusts wheel velocities according to inequalities between left and right reflectance values.
∣IL−IR∣≈0.
IL>IR,IL∈[Imin,Imax],
implemented by reducing or reversing the left wheel.IR>IL,IR∈[Imin,Imax],
applied symmetrically to the right wheel.IL≶IR;⟹;turn left/right,IL≈IR;⟹;forward.
Obstacle-Avoidance Robot
A proximity-based motion rule where wheel speeds depend on whether any sensor exceeds a threshold.
pi≤τ ∀i,
for some threshold τ.pj>τ,
the left wheel reverses, creating a turning motion away from the detected obstacle.(vL,vR)={(vmax,,vmax),maxipi≤τ,[4pt](−vmax,,vmax),maxipi>τ.
Wall-Follower Robot
A proximity-driven motion rule based on simple comparisons involving front-facing and left-side sensor values.
front wall: f>τ,left wall: ℓ>τ.
f>τ,
implemented as(vL,vR)=(vmax,−vmax).
f≤τ,ℓ>τ,
giving(vL,vR)=(vmax,,vmax).
f≤τ,ℓ≤τ,
with(vL,vR)=(81vmax,,vmax).
Δs=2sL+sR,θ=dsR−sL,
x′=x+Δscosθ,y′=y+Δssinθ,
allowing detection of when (x,y) enters the target region.Robotrix-2k25 - Stereo Vision Based 3D Hoop Control | Link
A simulation-based control project developed for the Robotrix-2k25 finals.
A ball is shot in random directions and with varying forces, and the robot must reposition a 3-axis hoop to intercept it.
Ball position cannot be accessed directly; only two stereo cameras mounted on the backboard are available.
Ball 3D position is reconstructed via color segmentation, stereo disparity, and camera→world transforms, followed by 3D trajectory prediction and PID-driven actuator control.
Stereo Vision + 3D Reconstruction + Predictive Control Workflow
Technical Summary
Ball detection uses HSV thresholding around the known orange color signature:
lower_hsv=(hmin,smin,vmin)
upper_hsv=(hmax,smax,vmax)
Contours and circle fitting yield pixel centers (xl,yl) and (xr,yr).
Stereo disparity:
d=xl−xr
Depth recovery:
Z=df⋅B
3D coordinates in camera frame:
X=f(x−cx)Z
Y=f(y−cy)Z
Coordinates are transformed into the world frame:
Pworld=RPcam+t
Relative ball→hoop position:
Prel=Pball−Phoop
Velocity estimation (finite differences):
vx=(X2−X1)/Δt
vy=(Y2−Y1)/Δt
vz=(Z2−Z1)/Δt
Projectile prediction:
x(t)=vxt+x0
y(t)=vyt+y0
z(t)=vzt+z0−21gt2
The target hoop position is chosen where the predicted trajectory intersects the hoop’s capture volume.
PID control per axis:
u(t)=Kpe(t)+Ki∫e(t)dt+Kddtde(t)
Errors:
ex=xtarget−xhoop
ey=ytarget−yhoop
ez=ztarget−zhoop
Each PID output drives its respective sliding joint through velocity commands.
Repository
Smart Vision Grocery Quality & Quantity Analysis - Flipkart GRiD 6.0 | Link
Qualified the Round-1 of GRiD-6.0
A vision-based system for automated grocery-item quality and quantity assessment.
The solution integrates a unified dataset (multiple Roboflow sources aggregated and re-annotated) and a consolidated training run using a CNN-based detector (YOLOv7 backbone). The pipeline evaluates produce freshness, packaging correctness, text/OCR extraction, and item count/brand verification.
Core Capabilities Implemented
Technical Summary
The system follows the GRiD 6.0 Smart Vision architecture:
Image Acquisition:
Uniform lighting normalization, noise filtering, and contrast stabilization ensure consistent input quality.
Preprocessing Pipeline:
Images undergo intensity normalization, edge-aware smoothing, and segmentation to isolate foreground products.
This supports text regions, geometric features, and surface attributes required for OCR and quality scoring.
Feature Extraction:
Text regions are processed using OCR; geometric features (edges, contours, size ratios), color-space transformations, and texture descriptors support defect/freshness detection.
Deep CNN embeddings from the trained model are used for brand/category classification, while SVM layers assist with high-similarity items.
Classification and Decision Rules:
Outputs are checked against a product database for correctness.
Freshness of produce uses color variance, texture irregularities, bruise signatures, and abnormal shape metrics.
Count estimation uses object-level consistency checks aligned with the event’s “IR-based counting” specification.
Output & Feedback:
Detected attributes (brand, count, OCR text, expiry date, freshness index) are logged.
A feedback loop stores misclassified samples for incremental dataset improvement.
Repository
$ cat ~/about_me.txt$ env --familiar | grep TOOLSTools I’m familiar with and use
$ git stats --all$ sudo wisdom