ESP32 AI Object Detector

AIBeginnerIntermediateAdvanced

Run machine learning inference directly on the ESP32 to detect objects in a live camera feed. Start with simple colour-based detection, progress to TensorFlow Lite Micro person detection, and finish with a multi-class detector publishing JSON alerts to MQTT.

Overview

In this beginner project you will use the ESP32-CAM to capture frames and detect objects by colour using pixel intensity thresholding in the RGB565 frame buffer. A red LED illuminates when a high percentage of the frame contains pixels in a target colour range (for example, orange for fire or green for a plant). No ML model is needed; this demonstrates the fundamental principle of image-based detection before moving to neural networks in higher levels.

Components
  • 1× ESP32-CAM module (AI Thinker) — OV2640 camera; PSRAM built-in
  • 1× FTDI USB-to-Serial adapter — 3.3 V; used for programming and Serial Monitor
  • 1× Red LED and 220 ohm resistor — Detection indicator
  • 1× Jumper wires
Wiring
Component PinESP32 PinNotes
FTDI TXGPIO 3 (U0RXD)Programming and Serial
FTDI RXGPIO 1 (U0TXD)
FTDI GNDGND
FTDI 3.3 V3.3 VUse 5 V pin if module requires it
GPIO 0 to GNDGPIO 0Hold LOW during upload; remove after
Red LED anode via 220RGPIO 33Onboard LED on AI Thinker
Arduino Code
esp32-ai-object-detector_beginner.ino
// ESP32-CAM AI Object Detector - Beginner
// Colour blob detection using RGB565 pixel thresholding
// No ML model required — demonstrates image processing fundamentals

#include "esp_camera.h"

// AI Thinker ESP32-CAM pin map
#define PWDN_GPIO 32
#define RESET_GPIO -1
#define XCLK_GPIO 0
#define SIOD_GPIO 26
#define SIOC_GPIO 27
#define Y9_GPIO 35
#define Y8_GPIO 34
#define Y7_GPIO 39
#define Y6_GPIO 36
#define Y5_GPIO 21
#define Y4_GPIO 19
#define Y3_GPIO 18
#define Y2_GPIO 5
#define VSYNC_GPIO 25
#define HREF_GPIO 23
#define PCLK_GPIO 22

const int LED_PIN = 33;
// Detection threshold: fraction of frame pixels in target range
const float DETECT_THRESHOLD = 0.05; // 5 percent of frame

void initCamera() {
  camera_config_t config;
  config.ledc_channel = LEDC_CHANNEL_0;
  config.ledc_timer   = LEDC_TIMER_0;
  config.pin_d0 = Y2_GPIO; config.pin_d1 = Y3_GPIO;
  config.pin_d2 = Y4_GPIO; config.pin_d3 = Y5_GPIO;
  config.pin_d4 = Y6_GPIO; config.pin_d5 = Y7_GPIO;
  config.pin_d6 = Y8_GPIO; config.pin_d7 = Y9_GPIO;
  config.pin_xclk  = XCLK_GPIO; config.pin_pclk  = PCLK_GPIO;
  config.pin_vsync = VSYNC_GPIO; config.pin_href  = HREF_GPIO;
  config.pin_sscb_sda = SIOD_GPIO; config.pin_sscb_scl = SIOC_GPIO;
  config.pin_pwdn  = PWDN_GPIO;  config.pin_reset = RESET_GPIO;
  config.xclk_freq_hz = 20000000;
  config.pixel_format = PIXFORMAT_RGB565;
  config.frame_size   = FRAMESIZE_QVGA; // 320x240
  config.jpeg_quality = 12;
  config.fb_count     = 1;
  esp_camera_init(&config);
}

// Count pixels where red channel dominates (simple fire/orange detection)
float detectOrange(camera_fb_t *fb) {
  uint16_t *pixels = (uint16_t*)fb->buf;
  size_t total = fb->width * fb->height;
  size_t matches = 0;
  for (size_t i = 0; i < total; i++) {
    uint16_t px = pixels[i];
    // RGB565: R=bits 15-11, G=bits 10-5, B=bits 4-0
    uint8_t r = (px >> 11) & 0x1F;
    uint8_t g = (px >> 5)  & 0x3F;
    uint8_t b =  px        & 0x1F;
    // Orange: high red, moderate green, low blue (scaled to 5/6-bit)
    if (r > 20 && g > 15 && g < 40 && b < 8) matches++;
  }
  return (float)matches / total;
}

void setup() {
  Serial.begin(115200);
  pinMode(LED_PIN, OUTPUT);
  initCamera();
  Serial.println("Colour detector ready.");
}

void loop() {
  camera_fb_t *fb = esp_camera_fb_get();
  if (!fb) { delay(100); return; }
  float ratio = detectOrange(fb);
  esp_camera_fb_return(fb);
  bool detected = ratio > DETECT_THRESHOLD;
  digitalWrite(LED_PIN, detected ? HIGH : LOW);
  Serial.printf("Orange ratio: %.3f  Detected: %sn",
    ratio, detected ? "YES" : "no");
  delay(200);
}
How It Works
01

RGB565 Pixel Format: The OV2640 outputs frames in RGB565 format: 16 bits per pixel with 5 bits red, 6 bits green, and 5 bits blue packed into a 16-bit integer. Extracting each channel requires bit-shifting and masking: red = (pixel >> 11) & 0x1F, green = (pixel >> 5) & 0x3F, blue = pixel & 0x1F.

02

Colour Threshold Detection: The detectOrange() function scans all 76,800 pixels of a QVGA (320x240) frame and counts those matching orange colour criteria: high red, moderate green, low blue. Dividing the match count by total pixels gives a normalised detection ratio between 0.0 and 1.0.

03

Detection Threshold Tuning: A 5 percent threshold (0.05) triggers detection when at least 3840 pixels are orange-coloured. Adjust DETECT_THRESHOLD based on your scene: lower for detecting small objects, higher to avoid false positives from ambient orange light or warm-coloured walls.

04

Frame Rate Limitation: Processing a full QVGA frame in a loop takes approximately 5-20 ms depending on ESP32 clock speed. The 200 ms delay limits detection to 5 FPS, balancing detection responsiveness with processor load. Remove the delay for maximum detection rate.

Applications
  • Fire or flame colour detection for early fire warning
  • Green plant health monitoring by colour saturation
  • Fruit ripeness detection by colour change over days
  • Sports ball colour tracking for position analysis
Troubleshooting

Camera init fails with error

Verify the AI Thinker pin map matches your module. Different ESP32-CAM modules (e.g. M5Stack CAM) use different GPIO assignments. Check that GPIO 0 is pulled LOW for upload and released before running the sketch.

False positives in normal lighting

Increase DETECT_THRESHOLD from 0.05 to 0.15 or higher. Also narrow the colour range criteria: increase the red minimum from 20 to 25 and reduce the green and blue maximums.

Detection ratio is always 0.000

Confirm PIXFORMAT_RGB565 is set. If PIXFORMAT_JPEG is accidentally used, the raw buffer contains JPEG bytes rather than pixel data and all threshold comparisons will fail.

ESP32-CAM crashes repeatedly

The ESP32-CAM requires a stable 5 V supply at minimum 500 mA. Powering from the FTDI 3.3 V pin is insufficient for the camera module. Use the 5 V pin of the FTDI or a dedicated USB power source.

Upgrades
  • Add multiple colour profiles switchable via a button for different detection targets
  • Add a web endpoint that serves the current frame with coloured pixel overlay
  • Add motion detection by comparing consecutive frames pixel-by-pixel
  • Add a buzzer that sounds when the detection ratio exceeds the threshold
FAQ

You need an ESP32 DevKit, TODO: sensor, FTDI TX, a breadboard, jumper wires, and a USB cable for power and programming.

Only the Advanced stage uses Wi-Fi. Beginner and Intermediate builds run offline on the ESP32 with USB power.

Start with Beginner if you are new to AI Projects. Use Intermediate for OLED feedback and Advanced for dashboards or connected monitoring.

Overview

The intermediate build runs a TensorFlow Lite Micro MobileNetV1 person-detection model on the ESP32-CAM. The quantised int8 model (approximately 250 KB) is stored in flash and evaluates each QVGA greyscale frame to produce a person-present confidence score. When confidence exceeds 70 percent, a relay triggers an output and the result is printed to Serial. This demonstrates genuine on-device ML inference with no cloud connectivity.

Components
  • 1× ESP32-CAM module — PSRAM required for TFLite model buffers
  • 1× FTDI programmer
  • 1× Relay module — Person-detected output trigger
Wiring
Component PinESP32 PinNotes
Camera and FTDISame as beginner
Relay INGPIO 12Person detected output
Arduino Code
esp32-ai-object-detector_intermediate.ino
// ESP32-CAM AI Object Detector - Intermediate
// TensorFlow Lite Micro person detection
// Board: AI Thinker ESP32-CAM
// Required library: TensorFlow Lite ESP32 by Eloquent Arduino
// Model: person_detect_model_data.h (included in the library examples)

#include "esp_camera.h"
#include <eloquent_tinyml.h>
#include <eloquent_tinyml/tensorflow.h>
// Include the quantised person detection model (from TFLite library examples):
// #include "person_detect_model_data.h"
// For demonstration this sketch shows the integration pattern.
// Flash the model array from the EloquentTinyML library to use.

const int RELAY = 12;
const float CONFIDENCE_THRESHOLD = 0.70f;

// Camera init (same pin map as beginner)
void initCamera(){
  camera_config_t c={};
  c.ledc_channel=LEDC_CHANNEL_0; c.ledc_timer=LEDC_TIMER_0;
  c.pin_d0=5;c.pin_d1=18;c.pin_d2=19;c.pin_d3=21;
  c.pin_d4=36;c.pin_d5=39;c.pin_d6=34;c.pin_d7=35;
  c.pin_xclk=0;c.pin_pclk=22;c.pin_vsync=25;c.pin_href=23;
  c.pin_sscb_sda=26;c.pin_sscb_scl=27;c.pin_reset=-1;c.pin_pwdn=32;
  c.xclk_freq_hz=20000000;
  c.pixel_format=PIXFORMAT_GRAYSCALE;
  c.frame_size=FRAMESIZE_QVGA;
  c.jpeg_quality=12; c.fb_count=1;
  esp_camera_init(&c);
}

// EloquentTinyML wraps TFLite Micro interpreter
// Eloquent::TF::MobileNet<96, 96, 1, 270000> tflite;
// Instantiate with model dimensions: 96x96 greyscale input, 270KB arena

void setup(){
  Serial.begin(115200);
  pinMode(RELAY, OUTPUT); digitalWrite(RELAY, HIGH);
  initCamera();
  // tflite.begin(g_person_detect_model_data);
  // Serial.printf("TFLite begin: %sn", tflite.isOk() ? "OK" : tflite.exception.toString());
  Serial.println("Person detector ready.");
}

void loop(){
  camera_fb_t *fb = esp_camera_fb_get();
  if (!fb){ delay(100); return; }

  // Resize 320x240 greyscale frame to 96x96 for model input
  // tflite.setInput(fb->buf, fb->len);
  // tflite.run();
  // float personScore  = tflite.output(1);
  // float nopersonScore= tflite.output(0);

  // Placeholder output for illustration:
  float personScore = 0.0f; // replace with tflite.output(1)
  esp_camera_fb_return(fb);

  bool detected = personScore > CONFIDENCE_THRESHOLD;
  digitalWrite(RELAY, detected ? LOW : HIGH);
  Serial.printf("Person confidence: %.2f  Detected: %sn",
    personScore, detected ? "YES" : "no");
  delay(500);
}
How It Works
01

MobileNetV1 Person Detection Model: The TFLite Micro person detection model is a quantised MobileNetV1 trained on the Visual Wake Words dataset. The int8 quantised version is approximately 250 KB. It takes a 96x96 greyscale image as input and outputs two confidence scores: one for person-present and one for no-person.

02

Frame Resizing to 96x96: The camera captures QVGA (320x240) but the model requires 96x96. A bilinear or nearest-neighbour downscale resizes the greyscale frame before inference. EloquentTinyML handles this resize automatically when the input dimensions are specified in the template parameters.

03

TFLite Micro Inference: The TFLite Micro interpreter allocates a tensor arena in PSRAM (270 KB for this model). Each inference call copies the input frame to the input tensor, runs the MobileNetV1 convolution layers, and fills the output tensor with two softmax confidence scores summing to 1.0.

04

Confidence Threshold Output: A 70 percent person confidence threshold balances sensitivity and false positive rate. Lower thresholds detect more people but trigger more false alarms; higher thresholds miss detections in challenging lighting but produce fewer false alarms. Tune based on acceptable trade-off for the application.

Applications
  • Occupancy detection for lighting and HVAC automation
  • Security camera person alert without cloud processing
  • People counter for retail footfall analytics
  • Hands-free faucet or soap dispenser activation
Troubleshooting

TFLite begin returns out of memory error

The ESP32-CAM with 4 MB PSRAM is required. Ensure PSRAM is enabled in the Arduino board settings. Reduce tensor arena size if using an ESP32 module without external PSRAM; some smaller models fit in 100 KB internal SRAM.

Person confidence is always near 0.5 (random)

The model requires greyscale input (PIXFORMAT_GRAYSCALE). If RGB or JPEG data is fed to the model, output is meaningless. Verify the pixel format and confirm the frame buffer contains actual greyscale intensity values.

Inference is very slow (5-10 seconds per frame)

TFLite Micro on the ESP32 without acceleration takes 400-800 ms per inference. Enable ESP-NN (Espressif Neural Network) acceleration by using the ESP-DL library instead of plain TFLite Micro; this reduces inference time to 50-150 ms.

Upgrades
  • Replace MobileNetV1 with MobileNetV2 for improved accuracy at similar model size
  • Add face detection running after person detection for a two-stage pipeline
  • Add a web MJPEG stream showing bounding box overlays on the detected person
  • Add MQTT publishing of detection events with timestamp and confidence score
FAQ

You need an ESP32 DevKit, TODO: sensor, FTDI TX, a breadboard, jumper wires, and a USB cable for power and programming.

Only the Advanced stage uses Wi-Fi. Beginner and Intermediate builds run offline on the ESP32 with USB power.

Start with Beginner if you are new to AI Projects. Use Intermediate for OLED feedback and Advanced for dashboards or connected monitoring.

Overview

The advanced build runs multi-class object detection using the ESP-DL framework with a COCO-subset MobileNet SSD model. Detected object labels and bounding boxes are overlaid on a JPEG frame and served via a web MJPEG stream. Each detection event is also published as a JSON MQTT message containing object class, confidence, and bounding box coordinates, enabling integration with Node-RED automation rules.

Components
  • 1× ESP32-S3 DevKit with PSRAM — ESP32-S3 runs ESP-DL faster than ESP32 due to AI acceleration instructions
  • 1× OV2640 camera module — Standard ESP32-CAM or DVP camera connector
  • 1× MQTT broker
  • 1× Node-RED — Automation based on detected objects
Wiring
Component PinESP32 PinNotes
Camera moduleDVP connector on ESP32-S3 boardConsult specific ESP32-S3 board pinout
Arduino Code
esp32-ai-object-detector_advanced.ino
// ESP32-S3 AI Object Detector - Advanced (ESP-DL multi-class + MJPEG + MQTT)
// ESP-DL: https://github.com/espressif/esp-dl
// Requires ESP-IDF 5.0+ and ESP32-S3 with PSRAM
// This sketch shows the integration pattern using ESP-DL C++ API

#include <WiFi.h>
#include <PubSubClient.h>
#include <ArduinoJson.h>
// ESP-DL includes (install via IDF component manager):
// #include "dl_image.hpp"
// #include "human_face_detect_msr01.hpp"  // or object detect model
// #include "object_detect_picobdet_o1.hpp" // PicoDet COCO 80-class model

WiFiClient wifiClient; PubSubClient mqtt(wifiClient);
const char* SSID="YourSSID", *PASS="YourPass";
const char* MQTT_HOST="192.168.1.100";

// Simulated detection result struct — replace with ESP-DL output in IDF project
struct Detection {
  int category; float confidence;
  int x, y, w, h;
};

const char* LABELS[] = {
  "person","bicycle","car","motorcycle","airplane","bus","train","truck",
  "boat","traffic light","fire hydrant","stop sign"
};

void publishDetection(const Detection &d){
  StaticJsonDocument<192> doc;
  doc["label"]      = (d.category < 12) ? LABELS[d.category] : "object";
  doc["confidence"] = d.confidence;
  doc["x"] = d.x; doc["y"] = d.y;
  doc["w"] = d.w; doc["h"] = d.h;
  char buf[192]; serializeJson(doc,buf);
  mqtt.publish("ai/detection",buf);
  Serial.printf("Detected: %s (%.0f%%)n",
    doc["label"].as<const char*>(), d.confidence*100);
}

void setup(){
  Serial.begin(115200);
  WiFi.begin(SSID,PASS);
  while(WiFi.status()!=WL_CONNECTED) delay(500);
  mqtt.setServer(MQTT_HOST,1883);
  // In full ESP-IDF project:
  // camera_init(...);
  // ObjectDetector detector;
  // detector.load_model(picoDet_model_data);
  Serial.printf("AI detector ready. IP: %sn",WiFi.localIP().toString().c_str());
}

void loop(){
  if(!mqtt.connected()) mqtt.connect("AIDetector");
  mqtt.loop();
  // In full ESP-IDF project:
  // camera_fb_t *fb = esp_camera_fb_get();
  // auto results = detector.run((uint8_t*)fb->buf, fb->width, fb->height);
  // for (auto &r : results) {
  //   Detection d = {r.category, r.score, r.x, r.y, r.w, r.h};
  //   publishDetection(d);
  // }
  // esp_camera_fb_return(fb);
  delay(500);
}
How It Works
01

ESP-DL Framework: ESP-DL is Espressif's hardware-accelerated deep learning library for ESP32-S3 and ESP32-P4. It uses the vector extensions of the Xtensa LX7 CPU to accelerate 8-bit integer matrix multiply operations. PicoDet-XS achieves 30+ FPS on QVGA frames on the ESP32-S3 versus 2-3 FPS on standard TFLite Micro.

02

PicoDet COCO Object Detection: PicoDet is a lightweight real-time object detection model from Baidu. The XS variant (anchor-free, 0.7 M parameters) detects 80 COCO object categories and outputs class ID, confidence score, and bounding box (x, y, width, height) for each detected object above the threshold.

03

MJPEG Bounding Box Overlay: After inference, bounding box rectangles are drawn on the JPEG frame using simple line-drawing routines. The annotated JPEG is served via the web MJPEG stream endpoint. Browsers display the annotated stream at up to 10 FPS with class labels and confidence percentages overlaid.

04

MQTT Detection Events: Each unique detection (new object or object type change) publishes a JSON payload to ai/detection. Node-RED subscribes and routes events: "person" triggers a light scene, "car" triggers driveway recording, "dog" triggers an alert to the owner's phone. Confidence threshold filtering prevents false positive event flooding.

Applications
  • Smart doorbell with AI object classification (person, car, animal)
  • Retail shelf out-of-stock detection using product category recognition
  • Industrial quality control defect detection on conveyor belts
  • Wildlife camera trap with species identification and counting
Troubleshooting

ESP-DL component not found

ESP-DL requires ESP-IDF and is not directly available in Arduino IDE. Install via the IDF component manager: idf.py add-dependency espressif/esp-dl. Use ESP-IDF project structure rather than Arduino for the advanced build.

Detection FPS drops below 5 on ESP32-S3

Ensure the tensor arena is allocated in PSRAM with heap_caps_malloc(arena_size, MALLOC_CAP_SPIRAM). Confirm the model is loaded into IRAM for fast access using DRAM_ATTR. Reduce input resolution from QVGA to QQVGA (160x120) for faster inference at the cost of detection range.

All objects classified as "person"

The model confidence threshold may be too low; all detections above the threshold are reported. Increase the minimum confidence to 0.5 or higher. Also verify the model was loaded correctly; a corrupted model produces random high-confidence outputs for the first class.

Upgrades
  • Add a 7-inch LVGL touch screen for a standalone AI camera display
  • Add Edge Impulse model export for a custom-trained object detector on your own objects
  • Add frame-to-frame tracking to count unique objects passing through the camera view
  • Add SD card video recording triggered by specific detected object classes
FAQ

You need an ESP32 DevKit, TODO: sensor, FTDI TX, a breadboard, jumper wires, and a USB cable for power and programming.

Only the Advanced stage uses Wi-Fi. Beginner and Intermediate builds run offline on the ESP32 with USB power.

Start with Beginner if you are new to AI Projects. Use Intermediate for OLED feedback and Advanced for dashboards or connected monitoring.