Longitudinal Synthetic Data Generation from Causal Structures
Resumo
Robust assessment of temporal causal-inference models is hampered by the lack of benchmark datasets whose underlying mechanisms are fully known. We introduce the Causal Synthetic Data Generator (CSDG), an open-source tool that creates longitudinal sequences governed by user-defined structural causal graphs with autoregressive dynamics. By allowing fine-grained control over confounding intensity, treatment policies, intervention timing, and noise, CSDG furnishes a flexible, domain-agnostic test-bed for stress-testing causal-learning algorithms. To demonstrate its utility, we generate synthetic cohorts for a one-step-ahead outcome-forecasting task and compare classical linear regression with encoder-decoder recurrent networks (vanilla RNN, LSTM, and GRU). The results reveal how predictive accuracy degrades as causal complexity increases, underscoring the need for models that explicitly exploit causal structure. Beyond forecasting, CSDG naturally extends to counterfactual data generation and bespoke causal graphs, paving the way for comprehensive, reproducible benchmarks across diverse application contexts.
Palavras-chave:
Benchmarks, Causal Inference, Longitudinal Data, Synthetic Data Generation, Time Series
Referências
Arkhangelsky, D. and Imbens, G. Causal models for longitudinal and panel data: a survey. The Econometrics Journal 27 (3): C1–C61, 06, 2024.
Balkus, S. and Hejazi, N. Causaltables.jl: Simulating and storing data for statistical causal inference in julia. Journal of Open Source Software vol. 10, pp. 7580, 02, 2025.
Box, G. E. and Jenkins, G. M. Time Series Analysis: Forecasting and Control. Holden-Day series in time series analysis and digital processing. Holden-Day, 1970.
Bun, M., Gaboardi, M., Neunhoeffer, M., and Zhang, W. Continual release of differentially private synthetic data from longitudinal data collections. Proc. ACM Manag. Data 2 (2), 2024.
Cheng, L., Guo, R., Moraffah, R., Sheth, P., Candan, K. S., and Liu, H. Evaluation methods and measures for causal algorithms. IEEE Transactions on Artificial Intelligence vol. 3, pp. 924–943, 2022.
Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.
Diggle, P. J., Heagerty, P. J., Liang, K.-Y., and Zeger, S. L. Analysis of Longitudinal Data. Oxford University Press, Oxford, UK, 2002.
Elman, J. L. Finding structure in time. Cognitive Science 14 (2): 179–211, 1990.
Enders, W. Applied Econometric Time Series. John Wiley & Sons, Hoboken, New Jersey, 2010.
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation 9 (8): 1735–1780, 11, 1997.
Kaddour, J., Lynch, A., Liu, Q., Kusner, M. J., and Silva, R. Causal machine learning: A survey and open problems, 2022.
Kühnel, L., Schneider, J., Perrar, I., Adams, T., Moazemi, S., Prasser, F., Nöthlings, U., Fröhlich, H., and Fluck, J. Synthetic data generation for a longitudinal cohort study–evaluation, method extension and reproduction of published data analysis results. Scientific Reports 14 (1): 14412, 2024.
Lütkepohl, H. New Introduction to Multiple Time Series Analysis. Springer Science & Business Media, 2005.
Melnychuk, V., Frauen, D., and Feuerriegel, S. Causal transformer for estimating counterfactual outcomes. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
Mendis, K., Wickramasinghe, M., and Marasinghe, P. Multivariate time series forecasting: A review. In Proce-edings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition. pp. 1–9, 2024.
Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
Pearl, J. The Book of Why: The New Science of Cause and Effect. Basic Books, New York, 2018.
Rubin, D. B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66 (5): 688–701, 1974.
Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14. MIT Press, Cambridge, MA, USA, pp. 3104–3112, 2014.
Wright, S. Correlation and causation. Journal of Agricultural Research 20 (7): 557–585, 1921.
Balkus, S. and Hejazi, N. Causaltables.jl: Simulating and storing data for statistical causal inference in julia. Journal of Open Source Software vol. 10, pp. 7580, 02, 2025.
Box, G. E. and Jenkins, G. M. Time Series Analysis: Forecasting and Control. Holden-Day series in time series analysis and digital processing. Holden-Day, 1970.
Bun, M., Gaboardi, M., Neunhoeffer, M., and Zhang, W. Continual release of differentially private synthetic data from longitudinal data collections. Proc. ACM Manag. Data 2 (2), 2024.
Cheng, L., Guo, R., Moraffah, R., Sheth, P., Candan, K. S., and Liu, H. Evaluation methods and measures for causal algorithms. IEEE Transactions on Artificial Intelligence vol. 3, pp. 924–943, 2022.
Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.
Diggle, P. J., Heagerty, P. J., Liang, K.-Y., and Zeger, S. L. Analysis of Longitudinal Data. Oxford University Press, Oxford, UK, 2002.
Elman, J. L. Finding structure in time. Cognitive Science 14 (2): 179–211, 1990.
Enders, W. Applied Econometric Time Series. John Wiley & Sons, Hoboken, New Jersey, 2010.
Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation 9 (8): 1735–1780, 11, 1997.
Kaddour, J., Lynch, A., Liu, Q., Kusner, M. J., and Silva, R. Causal machine learning: A survey and open problems, 2022.
Kühnel, L., Schneider, J., Perrar, I., Adams, T., Moazemi, S., Prasser, F., Nöthlings, U., Fröhlich, H., and Fluck, J. Synthetic data generation for a longitudinal cohort study–evaluation, method extension and reproduction of published data analysis results. Scientific Reports 14 (1): 14412, 2024.
Lütkepohl, H. New Introduction to Multiple Time Series Analysis. Springer Science & Business Media, 2005.
Melnychuk, V., Frauen, D., and Feuerriegel, S. Causal transformer for estimating counterfactual outcomes. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
Mendis, K., Wickramasinghe, M., and Marasinghe, P. Multivariate time series forecasting: A review. In Proce-edings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition. pp. 1–9, 2024.
Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
Pearl, J. The Book of Why: The New Science of Cause and Effect. Basic Books, New York, 2018.
Rubin, D. B. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66 (5): 688–701, 1974.
Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14. MIT Press, Cambridge, MA, USA, pp. 3104–3112, 2014.
Wright, S. Correlation and causation. Journal of Agricultural Research 20 (7): 557–585, 1921.
Publicado
29/09/2025
Como Citar
ANGERUZZI, Alessandro S.; ALBERTINI, Marcelo K..
Longitudinal Synthetic Data Generation from Causal Structures. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 13. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 49-56.
ISSN 2763-8944.
DOI: https://doi.org/10.5753/kdmile.2025.247519.
