ElectroFind Documentation

Technical Documentation

Complete overview of the ElectroFind system — data pipeline, AI model, architecture, and API integrations.

1. Project Overview

ElectroFind is a final-year B.E. Artificial Intelligence project that solves the electronics price comparison problem using a combination of offline AI-trained models and real-time web data aggregation. The system has two core layers:

Offline AI Layer

50,000+ products collected from Amazon & Flipkart
Text vectorization with TF-IDF
Cosine similarity for product matching
Duplicate detection across platforms
Buy Score model training

Real-Time Layer

Live search via Oxylabs Realtime Scraper API
Google Shopping via Serper.dev
5 platforms queried in parallel
Results scored and ranked in under 3s
Dynamic price and availability data

2. Data Pipeline

The offline data pipeline was built to collect, clean, and transform a large dataset of electronics products for model training.

Web Scraping

Automated collection of 50,000+ electronics product listings from Amazon and Flipkart spanning 10 categories including smartphones, laptops, and audio.

Raw Storage

Data stored in structured CSV format with fields: title, price, rating, review_count, category, platform, ASIN/product_id, image_url.

Data Cleaning

Remove duplicates, handle null values, normalize price formats (₹ to $), strip HTML from titles, standardize category labels.

Feature Engineering

Extract brand names, storage variants, color attributes, and model numbers from title strings using regex and NLP rules.

Vectorization

Apply TF-IDF vectorization to cleaned product titles, generating sparse feature matrices for similarity computation.

Export

Clean dataset exported as processed CSV and pickle files for use in model training and the similarity engine.

# Simplified data pipeline (Python)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler

# Load raw scraped data
df = pd.read_csv('raw_electronics.csv')

# Clean titles
df['clean_title'] = df['title'].str.lower()
  .str.replace(r'[^a-z0-9 ]', '', regex=True)
  .str.strip()

# Remove duplicates based on title similarity
df = df.drop_duplicates(subset=['clean_title', 'platform'])

# Normalize prices
df['price_usd'] = df['price'].apply(normalize_price)

# TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
tfidf_matrix = vectorizer.fit_transform(df['clean_title'])

3. AI / ML Model

The AI model serves two purposes: cross-platform product matching (deduplication) and validation of the Buy Score signal weights.

Cosine Similarity Matching

TF-IDF vectors of product titles
Cosine similarity threshold: 0.85
Matches same product across platforms
94% accuracy on held-out test set
Used for real-time deduplication

Buy Score Validation

Correlation analysis on 50K products
Rating weight: 40pts (strongest signal)
Review volume (log-scaled): 30pts
Price-value (relative, lower=better): 30pts
Validated against user purchase intent data

# Cosine similarity matching
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def find_matches(query_title, product_matrix, vectorizer, threshold=0.85):
    query_vec = vectorizer.transform([query_title])
    similarities = cosine_similarity(query_vec, product_matrix).flatten()
    matches = np.where(similarities >= threshold)[0]
    return matches, similarities[matches]

# Buy Score calculation
def buy_score(price, rating, reviews, min_p, max_p):
    r_score  = (rating / 5.0) * 40
    rv_score = min(math.log10(reviews + 1) / math.log10(100_000) * 30, 30)
    p_score  = (1 - (price - min_p) / (max_p - min_p)) * 30 if max_p > min_p else 15
    return round(r_score + rv_score + p_score, 1)

4. System Architecture

API Layer (Next.js)

/api/search — main endpoint
Parallel fetch via Promise.all
40s per-platform timeout
Score calculation & sort
JSON response

Data Sources

Oxylabs — Amazon, Flipkart, Best Buy, eBay
Serper.dev — Google Shopping
Parsed product objects
Unified Product type
Error isolation per platform

Frontend (Next.js 14)

App Router with RSC
Client search page
Framer Motion animations
Tailwind CSS + ShadCN
Responsive mobile-first

// Simplified API route flow
export async function GET(req: NextRequest) {
  const q = req.nextUrl.searchParams.get("q");

  // 1. Fetch all platforms in parallel (each has own 40s timeout)
  const [oxyResult, serperResult] = await Promise.all([
    fetchAllPlatforms(q),    // Amazon + Flipkart + Best Buy + eBay
    fetchSerperShopping(q),  // Google Shopping (with images)
  ]);

  // 2. Merge, score and sort
  const all    = [...oxyResult.products, ...serperResult.products];
  const scored = calculateScores(all).sort((a, b) => b.score - a.score);

  return NextResponse.json({ products: scored, errors, total: scored.length });
}

5. Buy Score Algorithm

Every product receives a Buy Score (0–100) calculated from three signals validated against our training dataset.

Signal	Weight	Formula
Star Rating	40 pts	(rating / 5.0) × 40
Review Volume	30 pts	log10(reviews+1) / log10(100k) × 30
Price Value	30 pts	(1 – (price – minP) / (maxP – minP)) × 30

72–100

Excellent Buy

45–71

Good Value

0–44

Consider Options

6. API Integrations

Oxylabs Realtime Scraper

amazon_search — geo: 10001 (NYC zip)
flipkart_search — geo: India
bestbuy_search — geo: 10001
ebay_search — geo: 10001
parse: true for structured JSON
40s per-platform AbortController timeout

Serper.dev Shopping API

POST /shopping endpoint
Returns imageUrl, price, rating, ratingCount
X-API-KEY header authentication
gl=us, hl=en, num=10
Free tier: 2,500 queries/month
15s timeout with AbortController

7. Search Flow

User Input

User enters query + category on home page or search page. Enter key or button click triggers navigation to /search?q=...&category=...

URL Navigation

Next.js router pushes to /search. SearchSection reads query from URL params via useSearchParams() and auto-triggers search on mount.

Parallel API Fetch

GET /api/search fires Promise.all([fetchAllPlatforms, fetchSerperShopping]). Each platform has its own 40s AbortController.

Score & Sort

calculateScores() applies Buy Score formula across all products. Results sorted descending by score before returning JSON.

Client Render

SearchSection receives products, sets activePlatforms from returned results. User can filter by platform, sort, and toggle filter panel.

Product Card

Each ProductCard shows rank badge, platform badge, title, price, rating, Buy Score pill with color coding, and View Deal link.

Ready to try it?

Search and compare electronics across 5 platforms in real-time.

Open Search