Dataocean AI Sets New Standards in Dataset Quality at Interspeech

Dataocean AI’s New Offerings

In the rapidly evolving landscape of artificial intelligence, particularly in foundation models and Generative AI, the demand for high-quality datasets has become increasingly crucial. As industries navigate the complexities of real-world data, it is evident that enhancing models is not the sole pathway to improved performance. Dataocean AI, a global leader in AI data solutions, recognised this need and has officially launched its latest offerings: premium off-the-shelf datasets. This initiative reinforces the company’s status as a leader in the field of AI technology.

Introducing the Massively Multilingual Speech Corpus

At Interspeech 2024, Dataocean AI introduced its innovative “Massively Multilingual Speech Corpus.” This extensive dataset features recordings from an impressive 215,891 speakers, amounting to a total of 259,672 hours of audio across more than 100 languages. Alongside this innovative corpus, the company presented carefully curated datasets in various European languages, such as English, French, Spanish, Turkish, and Swedish. These datasets are celebrated for their diversity and accuracy, which promise to significantly enhance the performance of AI models across various sectors, such as smart finance, AI assistance, in-cabin technologies, smart home applications, and other emerging AI trends.

Commitment to High Precision in Data Collection

Dataocean AI’s datasets stand out for their ability to deliver high precision across numerous fields. The company employs a robust data collection process, leveraging its extensive global network of native speakers who record professionally in over 200 languages. This initiative is bolstered by a dedicated team of native and professional speakers using high-fidelity equipment in professional recording studios, ensuring data quality in diverse environments, including indoor, outdoor, and in-cabin settings.

Advanced Data Labelling Techniques

In terms of data labelling, Dataocean AI utilises a sophisticated, self-developed platform that incorporates a human-in-the-loop approach. Their team of experts includes scholars and specialists from diverse fields, who have effectively created more than 1,100 speech datasets that meet stringent quality benchmarks. This dedication to excellence aligns with the changing needs of the AI industry.

Expanding Dataset Capabilities

In addition to its speech datasets, Dataocean AI boasts over 1,600 high-quality training datasets protected by proprietary intellectual property rights. These datasets encompass a wide range of areas, including foundation models, autonomous driving, finance, healthcare, and law. Furthermore, the company’s self-developed data processing platform, DOTS, features more than 200 algorithms and hundreds of data processing tools. This technology facilitates powerful functions such as automated and assisted data labelling, aiding customers in reducing costs and improving efficiency.

Ensuring Compliance and Data Security

Dataocean AI has also prioritised data security and compliance, achieving adherence to stringent regulations such as the European GDPR. The company has earned certifications for ISO 9001, ISO 27001, and ISO 27001, ensuring its operations meet high standards of safety and compliance.

Empowering AI with Live Data Collection

Alongside high-quality datasets, Dataocean AI is committed to enhancing large language models (LLMs) through world-class live data collection. This includes pre-training, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and model evaluation.

Dataocean AI’s mission is to deliver comprehensive data solutions that enable partners and clients to build reliable and adaptable AI models. This unwavering commitment to excellence remains central to the company’s vision of driving innovation in the AI sector.