Journal of Information and Data Management

Biophysical Chemistry of Macromolecules Research Group at the State University of Maringá

Diego de Souza Lima — 2024-02-21

The interdisciplinary field of Biophysical Chemistry, which applies concepts from Physical Chemistry to describe biological phenomena, is essential for modern molecular biology advancements. This approach enables the description of biological systems in terms of their constituent parts, such as atoms and molecules, facilitating a structural understanding of their characteristics. Nonetheless, to describe such large systems, computational methods are needed. The Biophysical Chemistry of Macromolecules research group at the State University of Maringá is dedicated to investigating such systems, mainly protein-ligand complexes, through bioinformatics approaches combined with experimental techniques to validate in silico results. The main purpose of the research projects is to develop applications for drug discovery in the context of antimicrobial, antiviral, antifungal, and antihyperglycemic agents, with the aim of advancing the field of bioinformatics in Brazil.

G3B3 to GP15: From early years to Health Informatics Research Group at Paulista University - UNIP

Renato Massaharu Hassunuma — 2024-02-16

This article summarizes the history and intellectual production from G3B3 (Grupo de Estudos em Bioinformática Estrutural) to GP15 (Grupo de Pesquisa em Informática em Saúde), conducted at Paulista University - UNIP, campus Bauru. In the early years, several activities were developed by G3B3 until the conversion of team to GP15. This group has the computational simulation of biomolecules as one of its research lines. Together with the second line of research, entitled "Development of teaching material and research using computational resources", the production of this group is mainly related to the development of scripts for the visualization of biomolecules and the production of digital books. In early 2022, the current team consists of five university professors and four students from the Biomedicine Course at Universidade Paulista - UNIP, Bauru campus. Together, the team has published more than 50 books in different publishers, part of which is aimed at the development of scripts for computer simulation software. All these books are adopted by course teachers in different subjects.

The Barroso Research lab: biomolecular interactions, computing, and data-driven science to understand and engineer biological and pharmaceutical systems in a global academic partnership

Fernando L. Barroso da Silva — 2024-04-16

Biomolecular interactions, high throughput computing, and data-driven science have been the central research foundations of the Barroso Research laboratory. We have been developing and applying innovative computational technology, offering a rational computational-based approach to the investigation of protein systems, and discovering key disease-related protein mechanisms, therapeutic agents, biomarkers, and proteins for specific applications and their controlled release. Born in 2001 at the School of Pharmaceutical Sciences at Ribeirão Preto with the genes of transdisciplinary and internationalization, the laboratory has always been well integrated with research groups in Europe, the US, and Latin America. Students from different fields and places have been forged in this environment at the crossroads of Structural Bioinformatics, Molecular Biophysics, Biological Physics, Physical Chemistry, Engineering, Medicine, Food, and Pharma. The more than 50 scientific papers published in high-impact journals, book chapters, and conference talks reflect our contributions to expanding knowledge and advancing Bioinformatics as an important tool to understand nature and guide innovations.

ACDBio: The Biological Data Computational Analysis group at ICMC/USP, IFSP, and Barretos Cancer Hospital

Adenilso Simao — 2024-02-17

Recent advances in biological and health technology have resulted in vast digital data. However, the major challenge is interpreting such data to find valuable knowledge. For this, using computing is essential and mandatory since quick data processing and analysis, allied with knowledge extraction techniques, enable working effectively with large biological datasets. In this context, the ACDBio group works with the computational analysis of biological data from different sources, aiming to find new information and knowledge in data or answer questions that are not yet known. So far, the group has worked on several challenging topics, such as identifying significant genes for cancer topological analysis of genes in interaction networks, among others. The group uses computational techniques such as complex networks and their algorithms, machine learning, and topological data analysis. This article aims to present the ACDBio group, and the main research topics worked on by its members. We also present the main results and future work expected by the group.

Water quality in marine and freshwater environments: a metagenomics approach

Carolina O. P. Gil — 2024-02-19

In this article, we have reviewed the work carried out by the UFRJ microbiology laboratory related to water quality and microorganisms associated with different aquatic ecosystems. We placed water at the center of the One Health concept, due to the integration that water makes between different living beings and the environment. We selected papers published between 2012 and 2022 by UFRJ microbiology laboratory related to bioinformatics genomic and metagenomics analysis. We described the main impacts caused in aquatic environments, about the microorganisms involved in biogeochemical cycles, microorganisms as bioindicators and their resistance genes. Finally, we identified the microorganisms that were most abundant in all the articles studied and pointed out some public policies that we consider important to maintain water quality and reduce anthropic impacts.

Bioinformatics of infectious and chronic diseases at the Center for Technological Development in Health of Fiocruz

Nicolas Carels — 2024-02-17

One of the bioinformatics purposes is data mining and integration to solve fundamental scientific challenges. We have been investigating biological systems including viruses, bacteria, fungi, protozoans, plants, insects, and animals with such concern. Gradually, we moved from basic questions on genome organization to application in infectious and chronic diseases by integrating interactome and RNA-seq data to modeling techniques such as Flux Balance Analysis, structural modeling, Boolean modeling, system dynamics, and computation biology in a system biology perspective. At the moment, we focus on the rational therapy of cancer assisted by RNA sequencing, network modeling, and structural modeling.

Let them eat cake: when the small aims at being LARGE or the empowering effects of bioinformatics in NGS wonderland

Gabriel M. Yazbeck — 2024-02-17

This report summarizes the path (and pitfalls) in the way of the Genetic Resources Laboratory (LARGE-UFSJ), trailed with the aid of bioinformatics, in the field of massive DNA data analyses and its application in the field of conservation of biodiversity, particularly of Neotropical migratory fish. We use the metaphor of DNA sequencing as the cake, both as a prized delicacy formerly inaccessible to the masses, as in the infamous ``let them eat cake", scornfully exclaimed by Marie-Antoinette during bread shortage in the French Revolution, but also as a means to achieve rapid growth for small research groups, as the plot device in Lewis Carroll' Alice in Wonderland. Next-Generation Sequencing (NGS) methods have been known to promote a true revolution in the Life Sciences, empowering groups with limited resources to explore the relatively new, still unknown and often surprising world of genetic sequences. Indeed, we argue for the inertia breaking potential of NGS and give our group's trajectory as a testimony. It all begun with the fortuitous union of providential fish DNA big-data gathered by Genetics professor, Dr. Yazbeck, and Computer Science professor, Dr. Sachetto's curiosity onto biological research, along the wit of some young researchers. Our initial NGS challenge was to provide the assembly and annotation of the first mitochondrial genome for the Anostomidae fish family. The LARGE's NGS research program was able to promote the characterization of what was then arguably the highest number of microsatellite DNA markers for the flagship species, Salminus brasiliensis (dourado) and Brycon orbignyanus (piracanjuba), useful in environmental applications for conservation (green biotechnology). We also have provided this large raw datasets, as well as elaborated massive results, freely available to the scientific community in data repositories such as GenBank, SRA and FigShare, such as genomic assemblies and gene annotation in these fish. Technological spin offs with application in the environmental protection and food production fields have also been devised as direct consequence of the availability of such rich and diverse data.

Bioinformatics and Computational Biology Research at the Computer Science Department at UFMG

Diego Mariano — 2024-02-16

Bioinformatics is an emerging research field that encompasses the use of computational methods, algorithms, and tools to solve life science problems. At the Laboratory of Bioinformatics and Systems (LBS), our research lines include the use of graph-based algorithms to improve the prediction of the structure and function of macromolecules, the detection of molecular recognition patterns, the application of mathematical models and artificial intelligence techniques to assist enzyme engineering, and development of models, algorithms, and tools. Additionally, the group has played a role in scientific outreach and spreading bioinformatics in Brazil. In this article, we summarize the 20 years of Bioinformatics and Computational Biology research conducted by our group at LBS in the Department of Computer Science at the Universidade Federal de Minas Gerais (DCC-UFMG).

Computational Biology Laboratory - Combi-Lab

Karina dos Santos Machado — 2024-02-16

This article presents the Computational Biology - Combi-Lab research group at the Universidade Federal do Rio Grande (FURG) which started its activities in 2011. The main objective of the group is to bring together researchers and students who are interested in all aspects of Computational Biology. Specifically, the group aims to develop, improve and use sophisticated statistical, computational, and mathematical methods to contribute to the advancement of this research area. This article provides an overview of the Combi-Lab timeline from its founding to the actual days, highlighting various articles and discussing about the future of the group. More importantly, joint projects and collaborators are presented, and their contribution to the development of the Bioinformatics is explained. In conclusion, as we look to the past and face the challenges of the future, we hold fast to our goal of becoming a solid and leading reference in Computational Biology at our university and community, and giving back to the society the maximum that we can.

NBioinfo: Establishing a Bioinformatics Core in a University-based General Hospital in South Brazil

Mariana Recamonde-Mendoza — 2024-02-16

Bioinformatics is an indispensable discipline for current research in life and medical sciences. The increasing volume and complexity of biological data and the growing tendency for open data and data reuse projects have made computer-based analytical tools central to these research fields. However, it is an intrinsic interdisciplinary field with a multitude of skill sets required for using bioinformatics tools or undertaking research toward developing new methods. There is still a lack of skilled human resources to meet the numerous and growing application possibilities, which represents a bottleneck in many research projects. This paper reports our efforts to create the Núcleo de Bioinformática (NBioinfo, or Bioinformatics Core) at the Hospital de Clínicas de Porto Alegre (HCPA), a major public university hospital in Brazil. NBioinfo aims to serve as a hub for research and interaction in Bioinformatics and Computational Biology at HCPA, institutionally developing these areas of knowledge and promoting scientific advances triggered by bioinformatics. We briefly present our research group's history and goals, and describe our activities toward providing HCPA with competencies in these fields. We also describe the scientific and methodological challenges recently faced by our group and the advances promoted by scientific collaborations and research projects developed at NBioinfo.

Usage of the Bag Distance Filtering with In-Memory Metric Trees

Sergio Luis Sardi Mergen — 2024-04-17

Metric trees are efficient indexing structures for multidimensional objects defined in terms of a metric space. One possible application is for string similarity search, using the edit distance as the metric function. A previous work proposes clustering objects under leaf nodes and using the bag distance as a filtering step before the edit distance is computed. Cost predictions estimate that the filtering compensates in practical scenarios. The work has important implications when data resides on secondary storage, where nodes have a fixed size that aligns with page disks. In this paper, we expand the discussion by using the bag distance filtering step for in-memory metric trees, where the clusters have no size constraints. We adjust existing metric trees to support leaf nodes with arbitrary cluster sizes and incorporate parameters based on size and density to decide when a leaf node should be subdivided. Experiments show that cluster size can have a substantial impact during both index construction and search. We report the gains achieved in terms of processing cost and the number of distance computations when using the most suited values for the cluster size and density parameters.

Built-up Integration: A New Terminology and Taxonomy for Managing Information On-the-fly

Maria Helena Franciscatto — 2024-02-19

Obtaining useful data to meet specific query requirements usually demands to integrate data sources at query time, which is known as on-the-fly integration. Currently, many studies address this concept by discovering useful data sources in an ad-hoc manner, and merging them for providing actionable information to the end user. This set of steps, however, lack a standardization in their identification, since they are described in the literature under many different names. Hence, without an unified nomenclature and knowledge organization, the development in the area may be considerably impaired. This paper proposes a novel term called Built-up Integration aiming at knowledge regulation, and a taxonomy for embracing a set of common tasks observed in studies that select and integrate sources on-the-fly. As result from the taxonomy, we demonstrate how Built-up Integration features can be found in the literature, through an exemplification with related studies. We also highlight research opportunities regarding Built-up Integration, as a way to guide future development in a subdomain of Data Integration.

Machine Learning Model Explainability supported by Data Explainability: a Provenance-Based Approach

Rosana Leandro de Oliveira — 2024-02-21

The task of explaining the result of Machine Learning (ML) predictive models has become critically important nowadays, given the necessity to improve the results' reliability. Several techniques have been used to explain the prediction of ML models, and some research works explore the use of data provenance in ML cycle phases. However, there is a gap in relating the provenance data with model explainability provided by Explainable Artificial Intelligence (XAI) techniques. To address this issue, this work presents an approach to capture provenance data, mainly in the pre-processing phase, and relate it to the results of explainability techniques. To support that, a relational data model was also proposed and is the basis for our concept of data explainability. Furthermore, a graphic visualization was developed to better present the improved technique. The experiments' results showed that the improvement of the ML explainability techniques was reached mainly by the understanding of the attributes' derivation, which built the model, enabled by data explainability.

The Impact of Representation Learning on Unsupervised Graph Neural Networks for One-Class Recommendation

Marcos Paulo Silva Gôlo — 2024-02-22

We present a Graph Neural Network (GNN) using link prediction for One-class Recommendation. Traditional recommender systems require positive and negative interactions to recommend items to users, but negative interactions are scarce, making it challenging to cover the scope of non-recommendations. Our proposed approach explores One-Class Learning (OCL) to overcome this limitation by using only one class (positive interactions) to train and predict whether or not a new example belongs to the training class in enriched heterogeneous graphs. The paper also proposes an explainability model and performs a qualitative evaluation through the TSNE algorithm in the learned embeddings. The methods' analysis in a two-dimensional projection showed our enriched graph neural network proposal was the only one that could separate the representations of users and items. Moreover, the proposed explainability method showed the user nodes connected with the predicted item are the most important to recommend this item to another user. Another conclusion from the experiments is that the added nodes to enrich the graph also impact the recommendation.

Using Non-Local Connections to Augment Knowledge and Efficiency in Multiagent Reinforcement Learning: an Application to Route Choice

Ana L. C. Bazzan — 2024-02-29

Providing timely information to drivers is proving valuable in urban mobility applications. There has been several attempts to tackle this question, from transportation engineering, as well as from computer science points of view. In this paper we use reinforcement learning to let driver agents learn how to select a route. In previous works, vehicles and the road infrastructure exchange information to allow drivers to make better informed decisions. In the present paper, we provide extensions in two directions. First, we use non-local information to augment the knowledge that some elements of the infrastructure have. By non-local we mean information that are not in the immediate neighborhood. This is done by constructing a graph in which the elements of the infrastructure are connected according to a similarity measure regarding patterns. Patterns here relate to a set of different attributes: we consider not only travel time, but also include emission of gases. The second extension refers to the environment: the road network now contains signalized intersections. Our results show that using augmented information leads to more efficiency. In particular, we measure travel time and emission of CO along time, and show that the agents learn to use routes that reduce both these measures and, when non-local information is used, the learning task is accelerated.

Two Meta-learning approaches for noise filter algorithm recommendation

Pedro B. Pio — 2024-02-23

Preprocessing techniques can increase the predictive performance, or even allow the use, of Machine Learning (ML) algorithms. This occurs because many of these techniques can improve the quality of a dataset, such as noise removal or filtering. However, it is not simple to identify which preprocessing techniques to apply to a given dataset. This work presents two approaches to recommend a noise filtering technique using meta-learning. Meta-learning is an automated machine learning (AutoML) method that can, based on a set of features extracted from a dataset, induce a meta-model able to predict the most suitable technique to be applied to a new dataset. The first approach returns a ranking of the noise filter techniques using regression models. The second sequentially applies multiple meta-models, to decide the most suitable noise filter technique for a particular dataset. For both approaches we extract the meta-features from use synthetics datasets and use as meta-label the f1-score value obtained by different ML algorithms when applied to these datasets. For the experiments, eight noise filtering techniques were used. The experimental results indicated that the rank approach acquired higher performance gain than the baseline, while the second obtained higher predictive performance. The ranking based approach also ranked the best algorithm in the top-3 positions with high predictive accuracy.

Legal Document Segmentation and Labeling Through Named Entity Recognition Approaches

Gabriel M. C. Guimarães — 2024-02-23

The document segmentation task allows us to divide documents into smaller parts, known as segments, which can then be labelled within different categories. This problem can be divided in two steps: the extraction and the labeling of these segments. We tackle the problem of document segmentation and segment labeling focusing on official gazettes or legal documents. They have a structure that can benefit from token classification approaches, especially Named Entity Recognition (NER), since they are divided into labelled segments. In this study, we use word-based and sentence-based CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models to bring together text segmentation and token classification. To validate our experiments, we propose a new annotated data set named PersoSEG composed of 127 documents in Portuguese from the Official Gazette of the Federal District, published between 2001 and 2015, with a Krippendorff's alpha agreement coefficient of 0.984. As a result, we observed a better performance for word-based models, especially with the CRF architecture, that achieved an average F1-Score of 75.65% for 12 different categories of segments.

Empirical Comparison of EEG Signal Classification Techniques through Genetic Programming-based AutoML: An Extended Study

Icaro M. Miranda — 2024-02-27

Machine Learning (ML) applications using complex data often need multiple preprocessing techniques and predictive models to find a solution that meets their needs. In this context, Automated Machine Learning (AutoML) techniques help to provide automated data preparation and modeling and improve ML pipelines. AutoML can follow different strategies, among them Genetic Programming (GP). GP stands out for its ability to create pipelines of arbitrary format, with high interpretability and the ability to customize information from the data domain context. This paper presents a comparative study of two AutoML approaches optimized with GP for the time series classification problem and its characterization through four domain-based feature sets. We selected the Electroencephalogram (EEG) signals as a case of study due to their high complexity, spatial and temporal co-variance, and non-stationarity. Our data characterization shows that using only spectral or time-domain features is unsuitable for achieving high-performance pipelines. Our results reveal how AutoML can generate more accurate and interpretable solutions than the literature's complex or ad hoc models. The proposed approach facilitates the analysis of dimensional reduction through fitness convergence, tree depth, and generated features.

How to balance financial returns with metalearning for trend prediction

Alvaro Valentim Pereira de Menezes Bandeira — 2024-02-27

The prediction of market price movement is an essential tool for decision-making in trading scenarios. However, there are several candidate methods for this task. Metalearning can be an important ally for the automatic selection of methods, which can be machine learning algorithms for classification tasks, named here classification algorithms. In this work, we present the use of metalearning for classification in market movement prediction and elaborate new analyses of its statistical implications. Different setups and metrics were evaluated for the meta-target selection. Cumulative return was the metric that achieved the best meta and base-level results. According to the experimental results, metalearning was a competitive selection strategy for predicting market price movement. This work is an extension of Bandeira et. al[2022].

Instance hardness measures for classification and regression problems

Gustavo P. Torquette — 2024-02-27

While the most common approach in Machine Learning (ML) studies is to analyze the performance achieved on a dataset through summary statistics, a fine-grained analysis at the level of its individual instances can provide valuable information for the ML practitioner. For instance, one can inspect whether the instances which are hardest to have their labels predicted might have any quality issues that should be addressed beforehand; or one may identify the need for more powerful learning methods for addressing the challenge imposed by one or a set of instances. This paper formalizes and presents a set of meta-features for characterizing which instances of a dataset are the hardest to have their label predicted accurately and why they are so, aka instance hardness measures. While there are already measures able to characterize instance hardness in classification problems, there is a lack of work devoted to regression problems. Here we present and analyze instance hardness measures for both classification and regression problems according to different perspectives, taking into account the particularities of each of these problems. For validating our results, synthetic datasets with different sources and levels of complexity are built and analyzed, indicating what kind of difficulty each measure is able to better quantify. A Python package containing all implementations is also provided.

LiPSet: A Comprehensive Dataset of Labeled Portuguese Public Bidding Documents

Mariana O. Silva — 2024-04-05

Collecting, processing, and organizing governmental public documents pose significant challenges due to their diverse sources and formats, complicating data analysis. In this context, this work introduces LiPSet, a comprehensive dataset of labeled documents from Brazilian public bidding processes in Minas Gerais state. We provide an overview of the data collection process and present a methodology for data labeling that includes a meta-classifier to assist in the manual labeling process. Next, we perform an exploratory data analysis to summarize the key features and contributions of the LiPSet dataset. We also showcase a practical application of LiPSet by employing it as input data for classifying bidding documents. The results of the classification task exhibit promising performance, demonstrating the potential of LiPSet for training neural network models. Finally, we discuss various applications of LiPSet and highlight the primary challenges associated with its utilization.

Datasets for Portuguese Legal Semantic Textual Similarity

Daniel da Silva Junior — 2024-04-05

The Brazilian judiciary faces a significant workload, leading to prolonged durations for legal proceedings. In response, the Brazilian National Council of Justice introduced the Resolution 469/2022, which provides formal guidelines for document and process digitalization, thereby creating the opportunity to implement automatic techniques in the legal field. These techniques aim to assist with various tasks, especially managing the large volume of texts involved in law procedures. Notably, Artificial Intelligence (AI) techniques open room to process and extract valuable information from textual data, which could significantly expedite the process. However, one of the challenges lies in the scarcity of datasets specific to the legal domain required for various AI techniques. Obtaining such datasets is difficult as they require some expertise for labeling. To address this challenge, this article presents four datasets from the legal domain: two include unlabelled documents and metadata, while the other two are labeled using a heuristic approach designed for use in textual semantic similarity tasks. Additionally, the article presents a small ground truth dataset generated from domain expert annotations to evaluate the effectiveness of the proposed heuristic labeling process. The analysis of the ground truth labels highlights that conducting semantic analysis of domain-specific texts can be challenging, even for domain experts. Nonetheless, the comparison between the ground truth and heuristic labels demonstrates the utility and effectiveness of the heuristic labeling approach.

Wiki Evolution dataset applicability: English Wikipedia revision articles represented by quality attributes

Ana Luiza Sanches — 2024-04-05

This paper presents the creation of the Wikipedia article's evolution dataset. This dataset is a set of revisions of articles, represented by quality attributes and quality classification. This dataset can be used for studies regarding automatic quality classification that consider the article revision history as well as understanding how the content and quality of articles evolve over time in this collaborative platform. To illustrate a potential application, this study provides a practical example of utilizing a Machine Learning model trained on the constructed dataset.

Workflow for the acquisition, processing, and dissemination of Brazilian public data focused on education

Abílio Nogueira Barros — 2024-04-05

This article aims to demonstrate the process of creating public databases focused on the educational and population areas. It describes the process of obtaining data from official government sources such as INEP (National Institute for Educational Studies and Research) and IBGE (Brazilian Institute of Geography and Statistics), the procedures for data adaptation and optimization to create their historical series, as well as the best practices followed for their development and the generated metadata. Highlighting the specificities between the themes of education and population, reporting their challenges and peculiarities of each dataset. It also reports the results that can already be directly obtained from each dataset and how, when combined, they can track indicators of the National Education Plan, one of the largest Brazilian public policies focused on education.

Indicators and Municipal Data: A Database for Evaluating the Efficiency of Public Expenditures

Paula Guelman Davis — 2024-04-06

This article describes the construction of a database with financial and operational data from Brazilian municipalities. Public data were collected regarding expenses by function (education, health, public security, among others), indicators and other data that reflect the municipal situation in the areas of education, health, public security, development, sanitation and finance. Data from various sources were integrated and transformed to allow studies on the correlation between performance indicators of the effectiveness of public governance, and the corresponding spenditures, to follow up and assess the effects of public policies.