Skip to content

Linpeng Tang

Data-Centric AI / AI Infra / Agentic RL
  • 💡 Data&AI Systems Expert: Billion-scale systems
    for biometrics, industrial AI, and AI4S
  • 🎓 Princeton CS PhD | Meta (Facebook) Systems | Moqi Co-founder & CTO
Linpeng Tang

Hi, I am an AI researcher and engineer currently working at Shanghai Institute of Advanced Algorithms Research, focusing on the intersection of LLMs and data systems (Data-Centric AI).

I have long been dedicated to the deep integration of AI and systems—from building massive-scale multimedia distribution systems at Meta (Facebook) and billion-scale biometric recognition systems, to developing the MyScaleDB AI database and data infrastructure for LLMs and AI4S. Currently, my focus is on leveraging AI & data infrastructure and Agentic RL to build AI data flywheels, creating advanced intelligent systems that can be reliably deployed in high-value, complex real-world scenarios.

I hold a Ph.D. in Computer Science from Princeton University, advised by Prof. Kai Li. My work has been recognized with honors including the WAIC (World AI Conference) SAIL Award and 1st place in the KDDCup.

Technical Thoughts

The Physical World Data Flywheel: Hierarchical Vision AI System Design and OptimizationModeling Physical AI: From VLA to World ModelAgentic RL (Part III): Architecture Analysis of Verl and SkyRL to Retool-RL Case PracticeFrom Data Processing to the Experience Flywheel: The Next Stage for LLM Data EngineeringAgentic RL (Part II): RL Systems for Real-World TasksAgentic RL (Part I): A New Paradigm for Self-Evolving LLMs

Selected Recognition

  • 🏆 WAIC (World AI Conference) SAIL Award, 2024
  • 🥇 First Prize, HICOOL Global Entrepreneur Summit, 2022
  • ⚙️ Flagship Systems: Data-Centric AI Platform, MyScale AI Database,
    Billion-scale national fingerprint database
  • 📚 Top-tier Publications: NSDI, KDD, FAST, CIKM Best Student Paper,
    KDDCup 1st Place

Experience

Institute of Advanced Algorithms Research
Data&AI Center | 2024 - Present

Moqi Technology
Co-founder & CTO | 2016 – 2024

Meta
Research Consultant | 2013 – 2016

HP Labs Beijing
Research Intern | 2011 - 2012

Education

Princeton University
Ph.D. in Computer Science | 2012 – 2018
Advisor: Prof. Kai Li (Member of National Academy of Engineering)

Shanghai Jiao Tong University
B.S. in Computer Science, ACM Class | 2008 – 2012

Products & Projects

Data-Centric AI Platform

2024 – Present

  • Led overall product architecture design and key project delivery, guiding the team to build a new generation of Agentic LLM & Data infrastructure.
  • Pioneered the development and implementation of a multimodal data intelligence pipeline system based on agents and the DataFlow data preparation framework. Built-in with 150+ intelligent operators, it supports natural language conversational automated pipeline orchestration, enabling highly efficient and flexible processing of massive heterogeneous data.
  • Addressing the high-risk hallucination challenge of LLMs in scientific and industrial scenarios, constructed a high-fidelity data synthesis and feedback system based on multi-tier verifiers (including rule filtering, knowledge graphs and simulations).
  • Disrupted the traditional data engineering paradigm that consumes 90% human effort, significantly lowering the barrier to producing AI-ready datasets. Successfully deployed in multiple benchmark scenarios such as industrial manufacturing, multimodal corpus management, and scientific corpora, drastically reducing the costs for enterprises to build specialized agents and LLMs.

MyScaleDB AI Database

2020 – Present

  • Responsible for defining product technical architecture, leading core vector search algorithm design and core engine R&D, creating a world-leading open-source AI database system.
  • Pioneered the concept of an AI database in the industry, innovatively achieving integrated management and joint retrieval of PB-level structured and unstructured data (vectors, graphs, text, spatio-temporal, etc.) within a single SQL kernel based on a columnar data engine.
  • Self-developed the MSTG vector engine and deeply combined it with a high-performance NVMe SSD memory caching mechanism for software-hardware co-optimization. While ensuring millisecond-level complex joint queries, achieved a 10x increase in vector data storage density.
  • Successfully implemented in large-scale knowledge base constructions for industrial manufacturing, AI for Science, and financial auxiliary decision-making, providing exceptional cost-effectiveness for massive corpora and widely used in a global SaaS.

Contactless Fingerprint & Palmprint Capture Device

2018 – 2022

  • Led the product definition of the world's first large-area, high-quality contactless fingerprint and palmprint capture terminal. Guided the team to overcome core technical challenges such as 3D reconstruction and complex optical image enhancement.
  • Combining binocular vision with a self-developed structured light system, achieved sub-millimeter high-precision 3D reconstruction of fingers. Introduced multi-source, multi-band optical designs and deep learning image enhancement algorithms, substantially breaking through ambient light interference.
  • Successfully disrupted industry pain points and technical bottlenecks of traditional contact-based capture, launching revolutionary contactless capture terminal devices, and driving inter-generational technological upgrades in security biometric capture hardware.

Massive Fingerprint Identification System

2015 – 2022

  • Responsible for core system architecture design and deep learning model R&D for massive fingerprint and palmprint matching.
  • Pioneered a multi-scale vector representation scheme, innovatively introducing an Active Deep Learning mechanism to drive model self-optimization and iteration. With joint CPU and GPU acceleration, broke through the technical bottleneck of 100-billion scale multi-scale feature indexing.
  • Improved the speed, accuracy, and automation of massive complex biometric feature retrieval by over 100 times. Successfully deployed at the National Fingerprint Center, generating significant social impact.

Video Popularity Prediction System

2015 – 2017

  • Responsible for high-performance algorithm design and implementation for Facebook's massive-scale video traffic trend prediction.
  • Self-developed a high-performance time-series probabilistic prediction model, deeply coupling and optimizing it with the underlying video compression strategy flow and real-time cache scheduling pipeline.
  • Achieved real-time accurate prediction of large-scale video popularity, improving prediction accuracy by over 10%. Supported Facebook in adopting smarter video compression conversion schemes and efficient cache scheduling, reducing system consumption while enhancing user viewing experience on the platform.

RIPQ Caching System

2013 – 2015

  • Responsible for core algorithm design and system implementation of a large-scale cache scheduling system based on SSD storage.
  • Pioneered the Restricted Insertion Priority Queue (RIPQ) caching algorithm, cleverly resolving the inherent non-sequential write amplification and sharp performance drop issues of traditional cache eviction mechanisms on Solid State Drives (SSDs) from the bottom layer.
  • Built a next-generation intelligent caching system with extremely low write amplification and high throughput features. Successfully deployed in Facebook's global CDN edge nodes and core caching systems, increasing cache hit rates by over 20% in large-scale concurrent environments, optimizing network request latency, and saving massive bandwidth costs.