125 experts in 72 hours: expert-led dataset creation for an AI data platform

Overview
A US-based human data company building datasets for AI labs needed to run domain-specific workflows across legal, healthcare, wellness, and software engineering.
Their existing pipelines worked well for simple tasks, but broke down once tasks required real judgment.
The problem: scale was solved, judgment wasn't
As tasks grew more complex, output quality became inconsistent.
Synthetic approaches handled structure, but failed on nuance. General contributor pools could complete tasks, but their outputs needed filtering and rework before they were usable.
The result: a hard ceiling on the kinds of datasets the company could produce, and slower iteration on the ones they were already building.
"The issue wasn't getting responses. It was being able to trust them once tasks required actual judgment."
— Head of Workforce
The approach: verified experts, plugged in directly
We gave the team access to a pool of verified domain experts and integrated them straight into their existing task workflows.
For software engineering alone, we sourced and vetted 125 qualified profiles in 72 hours, each ready for ongoing part-time work.
Across legal, healthcare, wellness, and software engineering, every expert was screened for domain background and consistency before being deployed on live tasks.
Execution: 1,200 tasks in 60 hours
Over a single 60-hour window, the system supported 1,200 task completions across all four domains.
Tasks ranged from structured evaluations to open-ended, multi-step reasoning. Most demanded explanation, not just answers.
The biggest unlock came from involving experts earlier in the process. Instead of just completing tasks, they gave feedback on task design itself — flagging unclear prompts, missing context, and likely failure modes before tasks went out at scale.
"Having experts shape the tasks, not just complete them, was the turning point."
— Head of Workforce
The impact: usable outputs on the first pass
The most immediate change was reliability. Tasks that previously took multiple passes started landing in a single workflow, with outputs usable without heavy post-processing.
Rapid sourcing also removed a major operational bottleneck. The team could move from idea to execution without waiting on a recruiting cycle.
The largest gains showed up in the hard cases — open-ended, ambiguous tasks where outputs had historically degraded. With verified experts, consistency held even when the task didn't.
"The biggest difference showed up in the hard cases. Outputs stopped breaking down when things weren't clear-cut."
— Head of Workforce
Example: a legal workflow
In one legal workflow, experts reviewed contract clauses, identified sources of liability, explained their reasoning, and proposed revisions.
Phrasing varied across submissions, but the answers were grounded in real-world practice and directionally consistent — making them directly usable as training signal.
Takeaway
Once tasks require reasoning, depth matters more than scale.
A smaller pool of verified experts produced more usable data than larger, unfiltered contributor groups. The difference showed up in consistency, in edge-case handling, and in how quickly outputs could be put to use downstream.
Contents





