Embedding blocking threshold optimization (#473)
- feat: Add runtime blocking threshold optimization
Co-authored-by: ss.shankar505 ss.shankar505@gmail.com
- Checkpoint before follow-up message
Co-authored-by: ss.shankar505 ss.shankar505@gmail.com
- Refactor: Simplify target_recall retrieval in Equijoin and Resolve
Co-authored-by: ss.shankar505 ss.shankar505@gmail.com
- Refactor: Improve blocking documentation and add auto-blocking
Co-authored-by: ss.shankar505 ss.shankar505@gmail.com
- allow resolve and equijoin to figure out blocking thresholds on the fly.
Co-authored-by: Cursor Agent cursoragent@cursor.com
📜 DocETL: Powering Complex Document Processing Pipelines
DocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:
🌟 Community Projects
📚 Educational Resources
🚀 Getting Started
There are two main ways to use DocETL:
1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)
DocWrangler helps you iteratively develop your pipeline:
DocWrangler is hosted at docetl.org/playground. But to run the playground locally, you can either:
make dockerSee the Playground Setup Guide for detailed instructions.
2. 📦 Python Package (For Production Use)
If you want to use DocETL as a Python package:
Prerequisites
Create a
.envfile in your project directory:To see examples of how to use DocETL, check out the tutorial.
2. 🎮 DocWrangler Setup
To run DocWrangler locally, you have two options:
Option A: Using Docker (Recommended for Quick Start)
The easiest way to get the DocWrangler playground running:
Create
.envin the root directory (for the backend Python server that executes pipelines):Create
.env.localin thewebsitedirectory (for DocWrangler UI features like improve prompt and chatbot):This will:
To clean up Docker resources (note that this will delete the Docker volume):
AWS Bedrock
This framework supports integration with AWS Bedrock. To enable:
Configure AWS credentials:
Test your AWS credentials:
Run with AWS support:
Or using Docker Compose:
Environment variables:
AWS_PROFILE: Your AWS CLI profile (default: ‘default’)AWS_REGION: AWS region (default: ‘us-west-2’)Bedrock models are pefixed with
bedrock. See liteLLM docs for more details.Option B: Manual Setup (Development)
For development or if you prefer not to use Docker:
Clone the repository:
Set up environment variables in
.envin the root/top-level directory (for the backend Python server): ```bash OPENAI_API_KEY=your_api_key_here # Used by DocETL pipeline execution engineBACKEND configuration
BACKEND_ALLOW_ORIGINS=http://localhost:3000,http://127.0.0.1:3000 BACKEND_HOST=localhost BACKEND_PORT=8000 BACKEND_RELOAD=True
FRONTEND configuration
FRONTEND_HOST=0.0.0.0 FRONTEND_PORT=3000
Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)
FRONTEND_DOCKER_COMPOSE_PORT=3031 BACKEND_DOCKER_COMPOSE_PORT=8081
Supported text file encodings
TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1
If you prefer using uv directly instead of Make:
Start the development server:
Visit http://localhost:3000/playground to access the interactive UI.
🛠️ Development Setup
If you’re planning to contribute or modify DocETL, you can verify your setup by running the test suite:
For detailed documentation and tutorials, visit our documentation.