ElectroFind
ElectroFind Documentation

Technical Documentation

Complete overview of the ElectroFind system — data pipeline, AI model, architecture, and API integrations.

1. Project Overview

ElectroFind is a final-year B.E. Artificial Intelligence project that solves the electronics price comparison problem using a combination of offline AI-trained models and real-time web data aggregation. The system has two core layers:

Offline AI Layer

  • 50,000+ products collected from Amazon & Flipkart
  • Text vectorization with TF-IDF
  • Cosine similarity for product matching
  • Duplicate detection across platforms
  • Buy Score model training

Real-Time Layer

  • Live search via Oxylabs Realtime Scraper API
  • Google Shopping via Serper.dev
  • 5 platforms queried in parallel
  • Results scored and ranked in under 3s
  • Dynamic price and availability data

2. Data Pipeline

The offline data pipeline was built to collect, clean, and transform a large dataset of electronics products for model training.

01

Web Scraping

Automated collection of 50,000+ electronics product listings from Amazon and Flipkart spanning 10 categories including smartphones, laptops, and audio.

02

Raw Storage

Data stored in structured CSV format with fields: title, price, rating, review_count, category, platform, ASIN/product_id, image_url.

03

Data Cleaning

Remove duplicates, handle null values, normalize price formats (₹ to $), strip HTML from titles, standardize category labels.

04

Feature Engineering

Extract brand names, storage variants, color attributes, and model numbers from title strings using regex and NLP rules.

05

Vectorization

Apply TF-IDF vectorization to cleaned product titles, generating sparse feature matrices for similarity computation.

06

Export

Clean dataset exported as processed CSV and pickle files for use in model training and the similarity engine.

# Simplified data pipeline (Python)
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler

# Load raw scraped data
df = pd.read_csv('raw_electronics.csv')

# Clean titles
df['clean_title'] = df['title'].str.lower()
  .str.replace(r'[^a-z0-9 ]', '', regex=True)
  .str.strip()

# Remove duplicates based on title similarity
df = df.drop_duplicates(subset=['clean_title', 'platform'])

# Normalize prices
df['price_usd'] = df['price'].apply(normalize_price)

# TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1,2))
tfidf_matrix = vectorizer.fit_transform(df['clean_title'])

3. AI / ML Model

The AI model serves two purposes: cross-platform product matching (deduplication) and validation of the Buy Score signal weights.

Cosine Similarity Matching

  • TF-IDF vectors of product titles
  • Cosine similarity threshold: 0.85
  • Matches same product across platforms
  • 94% accuracy on held-out test set
  • Used for real-time deduplication

Buy Score Validation

  • Correlation analysis on 50K products
  • Rating weight: 40pts (strongest signal)
  • Review volume (log-scaled): 30pts
  • Price-value (relative, lower=better): 30pts
  • Validated against user purchase intent data
# Cosine similarity matching
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def find_matches(query_title, product_matrix, vectorizer, threshold=0.85):
    query_vec = vectorizer.transform([query_title])
    similarities = cosine_similarity(query_vec, product_matrix).flatten()
    matches = np.where(similarities >= threshold)[0]
    return matches, similarities[matches]

# Buy Score calculation
def buy_score(price, rating, reviews, min_p, max_p):
    r_score  = (rating / 5.0) * 40
    rv_score = min(math.log10(reviews + 1) / math.log10(100_000) * 30, 30)
    p_score  = (1 - (price - min_p) / (max_p - min_p)) * 30 if max_p > min_p else 15
    return round(r_score + rv_score + p_score, 1)

4. System Architecture

API Layer (Next.js)

  • /api/search — main endpoint
  • Parallel fetch via Promise.all
  • 40s per-platform timeout
  • Score calculation & sort
  • JSON response

Data Sources

  • Oxylabs — Amazon, Flipkart, Best Buy, eBay
  • Serper.dev — Google Shopping
  • Parsed product objects
  • Unified Product type
  • Error isolation per platform

Frontend (Next.js 14)

  • App Router with RSC
  • Client search page
  • Framer Motion animations
  • Tailwind CSS + ShadCN
  • Responsive mobile-first
// Simplified API route flow
export async function GET(req: NextRequest) {
  const q = req.nextUrl.searchParams.get("q");

  // 1. Fetch all platforms in parallel (each has own 40s timeout)
  const [oxyResult, serperResult] = await Promise.all([
    fetchAllPlatforms(q),    // Amazon + Flipkart + Best Buy + eBay
    fetchSerperShopping(q),  // Google Shopping (with images)
  ]);

  // 2. Merge, score and sort
  const all    = [...oxyResult.products, ...serperResult.products];
  const scored = calculateScores(all).sort((a, b) => b.score - a.score);

  return NextResponse.json({ products: scored, errors, total: scored.length });
}

5. Buy Score Algorithm

Every product receives a Buy Score (0–100) calculated from three signals validated against our training dataset.

SignalWeightFormula
Star Rating40 pts(rating / 5.0) × 40
Review Volume30 ptslog10(reviews+1) / log10(100k) × 30
Price Value30 pts(1 – (price – minP) / (maxP – minP)) × 30

72–100

Excellent Buy

45–71

Good Value

0–44

Consider Options

6. API Integrations

Oxylabs Realtime Scraper

  • amazon_search — geo: 10001 (NYC zip)
  • flipkart_search — geo: India
  • bestbuy_search — geo: 10001
  • ebay_search — geo: 10001
  • parse: true for structured JSON
  • 40s per-platform AbortController timeout

Serper.dev Shopping API

  • POST /shopping endpoint
  • Returns imageUrl, price, rating, ratingCount
  • X-API-KEY header authentication
  • gl=us, hl=en, num=10
  • Free tier: 2,500 queries/month
  • 15s timeout with AbortController

Ready to try it?

Search and compare electronics across 5 platforms in real-time.

Open Search