Centralised validation and AI report generation:
Accelerating statistical dissemination at an NSO
A. Kanyesigye, I. Atwiine, F. Kayondo, and L. Mugula
Uganda Bureau of Statistics (UBOS), Kampala, Uganda
A National Statistical Office holds vast quantities of data, yet producing a single policy brief can take days. The bottleneck is not computation: it is that the underlying data lives in disconnected, incompatible systems. Poverty survey results arrive as Excel spreadsheets, census figures as PDF reports, price indices from administrative databases, sector statistics as SDMX feeds. Each has a different schema, update cadence, and quality profile. Before any analyst can write a paragraph, they must locate, extract, reconcile, and manually validate figures from several of these sources simultaneously. This paper presents a pipeline developed at the Uganda Bureau of Statistics (UBOS) that eliminates this bottleneck by combining automated multi-source validation with AI-driven report generation.
The pipeline ingests data from all source systems into a single centralised analytical layer through format-specific parsers and automated type detection. Before any data is admitted, it passes through a validation toolkit that runs structural checks (schema conformity, type enforcement, metadata completeness), statistical checks (range validation, confidence interval reconciliation, 3-sigma anomaly detection per indicator and disaggregation level), and a cleaning stage that logs every transformation with a full audit trail. A key principle is that no value is silently altered: rows requiring human judgment are tagged and quarantined rather than modified. A weighted quality score across structural, statistical, and conformity dimensions gates data into the centralised layer, which is maintained as a single trusted source of truth aligned to SDMX metadata standards.
Report generation operates directly against this centralised layer. Given a target document type — policy brief, statistical bulletin, or press release — a large language model retrieves the required indicators, generates data-grounded visualisations (bar charts, trend lines, maps via Plotly/Matplotlib), and produces structured prose following a configurable schema. Because the AI operates on already-validated, centralised data, there is no per-document extraction or re-validation step. What previously required days of manual compilation from disparate systems is reduced to minutes, and the same centralised layer can serve multiple document types concurrently.
The system is demonstrated on Uganda National Household Survey poverty indica- tors and 2024 Census preliminary results. We report end-to-end generation time versus manual authoring, quality score distributions across ingested sources, and domain-expert ratings of AI-generated policy briefs on accuracy, completeness, and usability. The results show that centralised, validated data is not merely a quality improvement — it is the architectural prerequisite that makes fast, reliable AI dissemination possible.
Keywords: Statistical data integration, Report generation, Data validation.
References
- [1] J. Reis, and M. Housley (2022). Fundamentals of Data Engineering. O’Reilly Media.
- [2] M.D. Wilkinson et al. (2016). The FAIR guiding principles for scientific data management. Scientific Data, 3, 160018.