|
시장보고서
상품코드
1872684
중국의 자율주행 데이터 클로즈드 루프(2025년)China Autonomous Driving Data Closed Loop Research Report, 2025 |
||||||
주요 포인트
2023-2025년, 합성 데이터의 비율은 20-30%에서 50-60%로 증가하여 롱테일 시나리오를 보완하는 핵심 자원이 되었습니다.
수집부터 배포까지 모든 프로세스를 자동화하는 툴체인이 단계적으로 탑재되어 비용 절감과 효율 향상에 기여하고 있습니다.
자동차/클라우드 통합 데이터 클로즈드 루프의 효율적인 연계가 신속한 반복 실현의 열쇠가 됩니다.
자율주행 데이터 클로즈드 루프의 본질은 "수집, 전송, 처리, 학습 및 배포"의 순환 최적화 시스템입니다. 2025년 산업은 '0→1' 단계부터 '고품질·고효율' 시대로 가속화되고 있으며, 핵심적인 과제는 롱테일 시나리오의 커버율과 비용 관리에 집중하고 있습니다. 자동차 제조업체와 Tier 1 공급업체는 자체 데이터 폐쇄형 루프 솔루션 구축을 적극적으로 추진하고 있습니다. 효율적인 데이터 수집·처리·분석 프로세스를 통해 자율주행 알고리즘을 지속적으로 개선하여 지능형 드라이빙 시스템의 정확성과 안정성을 대폭 향상시키고 있습니다.
고품질 데이터 수집의 효율성은 지능형 운전의 진화 속도를 결정합니다. 현재 자동차 부문의 데이터 소스에는 양산 차량에 의한 트리거 데이터 전송, 수집 차량에 의한 고가치 특정 시나리오 데이터 수집, 실세계 복원을 위한 도로측 실제 데이터를 이용한 엔지니어링 기법, 월드 모델에 기반한 데이터 합성 기술 등이 포함됩니다. 자율주행기술의 대규모 응용에 있어서 핵심적인 경로는 실제 데이터에 의한 기본 능력의 기반 구축과 합성 데이터에 의한 능력 한계의 돌파에 있습니다. 2023-2025년 자율주행 훈련 데이터의 실제 데이터와 합성 데이터의 비율은 크게 변화하고 초기 단계의 실제 데이터 중심 모델에서 합성 데이터의 비율이 증가하는 하이브리드 모델로 점진적으로 이행하고 있습니다.
자율주행 데이터의 클로즈드 루프는 초기 단계에서 단일 링크(예 : 어노테이션 효율 향상)에 대한 주력에서 '수집, 어노테이션, 트레이닝, 시뮬레이션, 배포'를 다루는 엔드 투 엔드 자동화 아키텍처로 전환되었습니다. 핵심 브레이크스루는 AI 대규모 모델과 클라우드 엣지 연계 기술을 통한 데이터 흐름 장벽의 돌파로 클로즈드 루프의 자체 진화를 실현하고 있습니다.
자동차용 클라우드 통합 데이터 클로즈드 루프의 본질은 "차량측 경량화 클라우드측 지능화"라는 협동 시스템을 구축하고 데이터 흐름 장벽을 타파함으로써 지능 차량의 지속적인 진화를 실현하는 것입니다. 차량측은 환경 지각 데이터(도로 상황, 차량 조작 데이터 등)의 실시간 수집을 담당하고, 비식별화·암호화·압축 처리 후에 클라우드에 업로드합니다. 클라우드는 방대한 데이터(PB/EB 레벨)를 처리하여 주석, 모델 교육 및 알고리즘 최적화를 수행합니다. 새로운 기능을 생성하고 차량 측에 전달함으로써 OTA 업그레이드를 실현합니다.
이 보고서는 중국의 자동차 산업에 대한 조사 분석을 수행하고 자율주행 데이터 클로즈드 루프에 대한 동향을 요약합니다.
용어집
Data Closed-Loop Research: Synthetic Data Accounts for Over 50%, Full-process Automated Toolchain Gradually Implemented
Key Points:
From 2023 to 2025, the proportion of synthetic data increased from 20%-30% to 50%-60%, becoming a core resource to fill long-tail scenarios.
Full-process automated toolchain from collection to deployment is gradually implemented, helping reduce costs and improve efficiency.
Efficient collaboration of the vehicle-cloud integrated data closed-loop is a key factor in achieving faster iterations.
The essence of autonomous driving data closed-loop is a cyclic optimization system of "collection-transmission-processing-training-deployment". In 2025, the industry is accelerating from the "0->1" stage to the "high-quality and high-efficiency" era, with core contradictions focusing on long-tail scenario coverage and cost control. OEMs and Tier 1 suppliers are actively establishing their own data closed-loop solutions. Through efficient data collection, processing and analysis processes, they continuously improve autonomous driving algorithms, thereby significantly enhancing the accuracy and stability of intelligent driving systems.
The efficiency of acquiring high-quality data determines the evolution speed of intelligent driving. Currently, data sources in the automotive field include mass-produced vehicle-triggered data transmission, high-value specific scenario data collection by collection vehicles, engineering practices for physical world restoration through roadside real data, and data synthesis technology based on world models. The core path for the large-scale application of autonomous driving technology -> real data anchors basic capabilities, and synthetic data breaks through capability boundaries. From 2023 to 2025, the proportion of real data and synthetic data in autonomous driving training data has undergone significant changes, gradually shifting from a real data-dominated model in the early stage to a hybrid model with an increasing proportion of synthetic data.
2023: Real data dominates, synthetic data starts (synthetic data accounts for 20%-30%): Real data is still the main body, mainly used for basic scenario training, but faces the problem of insufficient coverage of long-tail scenarios. For example, Tesla relied on real road test data from over one million vehicles in the early stage, but the collection efficiency of extreme scenarios (such as pedestrians breaking in during heavy rain) is low. Synthetic data accounts for about 20%-30%, mainly used to supplement long-tail scenarios. Experiments by Applied Intuition show that after adding 30% of synthetic data with frequent appearance of cyclists to real data, the recognition accuracy (mAP score) of the perception model for cyclists is significantly improved.
2024: Accelerated penetration of synthetic data (proportion rises to 40%-50%): Synthetic data has upgraded from an "auxiliary tool" to a "core production material". Its penetration rate rising to 40%-50% marks that intelligent driving has entered a new data-driven paradigm. At the end of 2024, the Shanghai High-level Autonomous Driving Demonstration Zone launched a plan of 100 data collection vehicles. Through a hybrid model of "real data collection + world model-generated virtual data", the proportion of synthetic data is close to 50%; for example, Nvidia DRIVE Sim generates synthetic data of distant objects (100-350 meters) to solve the problem of sparse real annotations. After adding 92,000 synthetic images, the detection accuracy (F1 score) of vehicles 200 meters away is improved by 33%.
2025: Synthetic data surpasses (accounts for over 50%): The ratio of synthetic data to real data moves towards "5:5" or even higher. Academician Wu Hequan pointed out that 90% of the training for L4/L5 is simulation data, and only 10%-20% of real data is retained as a "gene pool" to avoid model deviation. In terms of innovative applications of synthetic data, take Li Auto as an example. It uses world models to reconstruct historical scenarios and expand variants (such as virtualizing ordinary intersections into rainy night and foggy conditions), and automatically generates extreme cases for cyclic training. The proportion of synthetic data in Li Auto exceeds 90%, replacing real-vehicle testing and verifying reliability.
According to Lang Xianpeng from Li Auto, in 2023, the effective real-vehicle test mileage of Li Auto was about 1.57 million kilometers, with a cost of 18 yuan per kilometer. By the first half of 2025, a total of 40 million kilometers had been tested, including only 20,000 kilometers of real-vehicle testing and 38 million kilometers of synthetic data. The test cost dropped to an average of 0.5 yuan per kilometer. Moreover, the test quality is high, all scenarios can be inferred from one instance, and complete retesting is possible.
The advantages of synthetic data are not only reflected in cost and efficiency but also in its value density beyond human experience. Synthetic data is generated in batches through technical means at extremely low cost, perfectly matching the high-frequency training needs of AI; it can also independently generate extreme corner case scenarios that "humans have not experienced but comply with physical laws".
The autonomous driving data closed-loop has shifted from focusing on a single link (such as improving annotation efficiency) in the early stage to an end-to-end automated architecture covering "collection-annotation-training-simulation-deployment". The core breakthrough is to break through data flow barriers through AI large models and cloud-edge collaboration technology, realizing closed-loop self-evolution.
LiangDao Intelligence LD Data Factory is a full-link 4D ground truth solution from collection to delivery. The LD Data Factory toolchain product has been delivered to more than a dozen automotive OEMs and Tier 1s in China, Germany, and Japan. This automated 4D annotation tool software has automatically annotated more than 3,300 hours of road-collected data for customers, obtaining high-quality 4D continuous frame ground truth; by the middle of 2025, LiangDao Intelligence had delivered more than 55 million frames of data to a well-known German luxury car brand.
LD Data Factory integrates "data collection, automated annotation, manual annotation, quality control, and performance evaluation". The toolchain includes AI preprocessing and VLM-assisted collection, an automated annotation module for target detection, full-process closed loop of automatic quality inspection, and hybrid cloud and private deployment. LD Data Factory covers several core modules and realizes data management and task collaboration through a unified data management platform: including time synchronization and spatial calibration, distributed storage and indexing services, a visual annotation platform LDEditor (full-stack annotation), an automated quality control module LD Validator, and a perception performance evaluation module LD KPI.
Main products under MindFlow currently include an integrated data annotation platform, a data management platform (including a vector database), and a model training platform, covering the entire value chain from raw data to model implementation. Users can complete the entire algorithm development process in one stop without switching multiple tools or platforms, redefining a new paradigm of AI data services. The technical highlights of its MindFlow SEED platform (third generation) include support for 4D point cloud annotation (lane lines, segmentation), RPA automated processes, and AI pre-annotation covering more than 4,000 functional modules.
Currently, MindFlow empowers customers including SAIC Group, Changan Automobile, Great Wall Motors, Geely Automobile, FAW Group, Li Auto, Huawei, Bosch, ECARX, MAXIEYE, NavInfo and RoboSense.
The essence of the vehicle-cloud integrated data closed-loop is to build a collaborative system of "vehicle-side lightweight + cloud-side intelligence", break through data flow barriers, and realize the continuous evolution of intelligent vehicles. The vehicle side is responsible for real-time collection of environmental perception data (such as road conditions, vehicle operation data), which is uploaded to the cloud after desensitization, encryption, and compression. The cloud processes massive amounts of data (PB/EB level), performs annotation, model training, and algorithm optimization, generates new capabilities, and issues them to the vehicle side to realize OTA upgrades.
The ExceedData data closed-loop solution is a vehicle-cloud integrated solution, which has gained the trust and mass production application of more than 15 automotive OEMs and is deployed in more than 30 mainstream models.
The composition of the ExceedData data closed-loop solution includes the vehicle-side edge computing engine (vCompute), edge data engine (vADS), edge database (vData), as well as the cloud-side algorithm development tool (vStudio), cloud computing engine (vAnalyze), and cloud management platform (vCloud). This solution can reduce data transmission costs by 75%, cloud storage costs by 90%, and cloud computing costs by 33%. According to the calculation of an OEM case cooperating with ExceedData: the total cost optimization can be reduced by 85%.
In terms of OEMs, take Xpeng Motors as an example. Its self-built "cloud-side model factory" has a computing power reserve of 10 EFLOPS in 2025, and the end-to-end iteration cycle is shortened to an average of 5 days, supporting rapid closed-loop from cloud-side pre-training to vehicle-side model deployment.
Xpeng launched China's first 72 billion parameter multimodal world base model for L4 high autonomous driving, which has chain-of-thought (CoT) reasoning capabilities and can simulate human common-sense reasoning and generate control signals. Through model distillation technology, the capabilities of the base model are migrated to the vehicle-side small model, realizing personalized deployment of "small size and high intelligence".
High-value data (such as corner cases) is initially screened through the vehicle-side rule engine. The cloud combines synthetic data generation technologies (such as GAN, diffusion models) to fill data gaps and improve model generalization capabilities. At the same time, end-to-end (E2E) and VLA models integrate multimodal inputs to directly output control commands, relying on cloud-side large model training (such as Xpeng's 72 billion parameter base model) to achieve lightweight deployment on the vehicle side.
With the comprehensive modeling of the entire intelligent driving system, car companies are pursuing "better cost, higher efficiency, and more stable services" in the data closed-loop. The delivery method of intelligent driving is accelerating from delivering code for single-vehicle deployment to a subscription-based cloud service as the core. The efficiently collaborative data closed-loop of vehicle-cloud integration is the key for intelligent vehicles to achieve faster iterations driven by AI.
Glossary