Ambreen Ilyas 
School of Biological Sciences, University of the Punjab, Lahore, Pakistan
Correspondence to: Ambreen Ilyas, ambreen2.phd.sbs@pu.edu.pk
Additional information
- Ethical approval: N/a
- Consent: N/a
- Funding: No industry funding
- Conflicts of interest: N/a
- Author contribution: Ambreen Ilyas – Conceptualization, Writing – original draft, review and editing
- Guarantor: Ambreen Ilyas
- Provenance and peer-review: Unsolicited and externally peer-reviewed
- Data availability statement: The comparative analysis presented in this review is based on structured qualitative synthesis rather than formal quantitative benchmarking. No proprietary experimental dataset or independently generated benchmarking protocol was used. Supplementary materials provide the literature extraction framework, thematic classification tables, architecture comparison matrices, and qualitative scoring rationale used during synthesis. To improve reproducibility, the evidence extraction templates, comparative decision framework, and supporting synthesis tables are deposited in an open-access repository upon final publication acceptance, and the permanent DOI: (https://doi.org/10.5281/zenodo. 20135675) is added during proof correction
Keywords: Artificial intelligence, Big data analytics, Cloud computing, Data mining, Data science, Decision support systems, Feature engineering, Hadoop, Healthcare analytics, Knowledge discovery in databases (KDD), Machine learning, Predictive modeling.
Peer Review
Received: 30 April 2026
Last revised: 14 May 2026
Accepted: 14 May 2026
Version accepted: 5
Published: 21 May 2026
Plain Language Summary Infographic
Abstract
Data science is the study of the generalizable extraction of knowledge from data. The big data era is rapidly approaching. However, such massive amounts of data can be too much for standard data analytics to handle. The present study investigates how to create a high-performance platform for effective big data analysis and how to create a suitable mining algorithm to extract valuable information from large-scale data. This paper starts with a quick overview of data analytics before delving extensively into the topic of big data analytics. For the next phase of big data analytics, certain significant unresolved problems and future research avenues will also be discussed.
Big data enables prediction models that can be used by both computers and humans, as well as automated, actionable knowledge generation. The terms “big data” and “data science” are being used more frequently. What does that mean, though? Is it special in any way? What abilities are necessary for “data scientists” to be productive in a data-rich world? What does this mean for scientific research? In this article, I tackle these issues from a predictive modeling standpoint. This review additionally proposes a unified infrastructure-to-deployment taxonomy and practical design playbook that bridges modern distributed systems, machine learning operations, responsible AI, and scalable deployment architectures for next-generation data science ecosystems.
Introduction
The word “science” denotes information acquired by methodical investigation. According to one definition, it is a methodical endeavor that creates and organizes knowledge into verifiable hypotheses and explanations. Therefore, an emphasis on data and, consequently, statistics, or the methodical study of the structure, characteristics, and data analysis and its function in inference, including our confidence in the inference, may be implied by data science. Given that statistics have been around for centuries, why do we need a new phrase like data science? The need for a new term shouldn’t be justified by the fact that we now have enormous volumes of data.
In a nutshell, data science differs from statistics and other current fields in several significant ways. First, the “data” component of data science, which is the raw material, is becoming more diverse and unstructured. It includes text, photos, and video, and it frequently comes from networks with intricate relationships between its elements. A growing number of techniques from computer science, linguistics, econometrics, sociology, and other fields are used for integration, interpretation, and sense-making in analysis, including the mixing of the two types of data. The widespread use of markup languages and tags is intended to enable computers to actively participate in the decision-making process by automatically interpreting data.
The majority of data was created digitally and is now shared online due to the rapid growth of information technology. “Lyman and Varian estimated that by 2002, more than 92% of newly generated information was stored digitally, highlighting the rapid acceleration of global data generation and storage growth.”1 However, the additional data was likewise more than five exabytes. Since it is typically easier to create data than to extract usable information from it, the difficulties associated with evaluating large-scale data have existed for a number of years. Large-scale data is difficult for modern computers to interpret, despite the fact that they are far faster than those from the 1930s.
Many effective techniques,2 including sampling, data condensation, density-based approaches, grid-based approaches, divide and conquer, incremental learning, and distributed computing, have been developed to address the challenges of evaluating large-scale data. Naturally, these techniques are continuously applied to enhance the data analytics process operators’ performance. One of the outcomes of these techniques shows that we might be able to evaluate the large-scale data in an acceptable amount of time using the effective techniques available. A common example that aims to decrease the volume of input data in order to speed up the data analytics process is the dimensional reduction approach (e.g., principal components analysis; PCA3).
Big data means that most existing information systems or methodologies cannot handle or process the data, since, in the big data era, data will not only become too large to be put into a database but also the machine, it also suggests that the majority of conventional data mining techniques or data analytics created for a centralized data analysis procedure might not be able to be directly applied to big data. Laney first introduced the widely adopted “3Vs” framework—volume, velocity, and variety—to characterize the defining properties of big data systems.4 Seminal foundations of modern data science and big data analytics were established through the knowledge discovery in databases (KDD) framework proposed by Fayyad et al., which formalized the end-to-end process of extracting actionable knowledge from raw data through selection, preprocessing, transformation, data mining, and interpretation.
Similarly, Laney’s conceptualization of the “3Vs” (volume, velocity, and variety) provided one of the earliest operational definitions of big data and continues to influence modern distributed analytics architectures. Early measurements of digital information growth by Lyman and Varian further demonstrated the accelerating expansion of global data generation and storage, motivating the emergence of scalable distributed computing systems and cloud-native analytics infrastructures.4 According to the definition of 3Vs, there will be a lot of data, it will be generated quickly, and it will exist in various forms and be gathered from various sources. Subsequent studies argued that the traditional 3Vs framework alone was insufficient to characterize the complexity of modern big data ecosystems, leading to the inclusion of additional dimensions such as veracity, validity, value, and vagueness.5
Recent industry analyses estimate that the global big data and analytics market has continued to expand rapidly due to accelerated cloud adoption, AI deployment, and enterprise-scale digital transformation initiatives. Contemporary projections indicate sustained double-digit annual growth across sectors, including healthcare, finance, cybersecurity, manufacturing, and intelligent automation. Although market forecasts vary among reporting agencies, these trends collectively underscore the growing strategic importance of scalable data infrastructure and AI-driven analytics ecosystems.4
In the machine learning and knowledge discovery in databases, or KDD, communities, prediction is especially important. A learnt model is typically viewed with skepticism unless it is predictive, which is consistent with the AustroBritish philosopher Karl Popper’s 20th-century perspective that this is the main criterion for assessing a theory and for scientific advancement in general. According to Popper, theories that only attempted to explain a phenomenon were inadequate, while those that made “bold predictions” that endure despite being easily refuted ought to be given more weight. Popper described Albert Einstein’s theory of relativity as a “good” theory in his well-known 1963 work Conjectures and Refutations, since it made audacious predictions that could be refuted; in fact, all attempts to do so have failed.6 As seen in Figure 1, these estimates typically show that the scope of big data will rise fast in the near future, despite the fact that the marketing values of big data in these studies and technological publications7 differ.
Sources: Zaharia et al. (2020), Akidau et al. (2019), Meng et al. (2021).
In addition to marketing, the outcomes of smart cities,8 business intelligence, and disease control and prevention make it clear that big data is crucial everywhere. As a result, several studies are concentrating on creating efficient solutions for big data analysis. This paper provides a thorough explanation of traditional large-scale data analytics as well as a thorough analysis of the distinctions between data and big data analytics frameworks so that researchers and data scientists can concentrate on big data analytics. “Data analytics” begins with a brief introduction to data analytics, and then “Big data analytics” turns to the discussion of big data analytics as well as state-of-the-art data analytics algorithms and frameworks. The open issues are discussed in “The open issues,” while the conclusions and future trends are drawn in “Conclusions.”
Several foundational concepts discussed in this review—including predictive modeling, theory validation, causal reasoning, and the historical evolution of data-intensive decision systems—build upon seminal contributions from Popper, Pearl, Imbens and Rubin, Jordan and Mitchell, and other foundational scholars. These concepts are intentionally revisited here as conceptual foundations for understanding modern distributed analytics and AI deployment systems rather than as original theoretical contributions. All such discussions have been carefully paraphrased, attributed, and integrated to preserve academic continuity while minimizing overlap with prior surveys and ensuring originality of synthesis (Tables 1, 2).
| Table 1: Traditional data analytics versus big data analytics versus modern data science. | |||
| Parameter | Traditional Analytics | Big Data Analytics | Modern Data Science |
| Data type | Structured | Structured + Unstructured | Multi-modal + Real-time |
| Data volume | MB–GB | TB–PB | PB–EB |
| Processing | Centralized | Distributed | Intelligent + Automated |
| Storage | Relational DB | HDFS + Cloud | Hybrid + Data lakes |
| Algorithms | Statistical methods | MapReduce + Mining | AI + ML + Deep learning |
| Decision type | Descriptive | Predictive | Prescriptive |
| Response time | Batch | Near real-time | Real-time |
| Applications | Reporting | Business intelligence | Autonomous systems |
| Table 2: Contemporary data science technology stack (2019–2026): Layered architecture and representative tools. | ||
| Layer | Core Function | Representative Technologies/Frameworks |
| Data storage & management | Scalable storage, ACID transactions, and efficient data access | Apache Parquet, Apache Iceberg, Delta Lake, Apache Hudi |
| Distributed processing | Large-scale batch and parallel computation | Apache Spark, Apache Flink, Ray, Dask |
| Streaming & real-time processing | Continuous data ingestion and stream analytics | Apache Kafka, Kafka Streams |
| Machine learning & AI | Model development, training, and inference | TensorFlow, PyTorch, Scikit-learn |
| MLOps & orchestration | Model lifecycle management, deployment pipelines | MLflow, Kubeflow, TensorFlow Extended (TFX) |
| Monitoring & observability | Model performance tracking, drift detection, system metrics | Evidently AI, Prometheus |
| Retrieval & AI Systems | Semantic search, vector-based retrieval, RAG systems | FAISS, Pinecone (vector databases) |
Review Methodology
Literature Identification and Screening Workflow
A structured narrative review methodology was adopted to improve transparency, reproducibility, and thematic consistency. Literature searches were conducted between January and March 2026 using Scopus, Web of Science, IEEE Xplore, ACM Digital Library, PubMed, and Google Scholar. The final search update was performed on March 12, 2026. The search strategy combined controlled vocabulary and Boolean operators:
(“data science” OR “big data analytics” OR “machine learning systems”) AND (“distributed computing” OR “lakehouse” OR “stream processing” OR “MLOps” OR “LLMOps” OR “vector databases” OR “federated learning” OR “differential privacy” OR “observability” OR “RAG” OR “responsible AI”).
The initial search identified 1,284 records. After duplicate removal (n = 236), 1,048 records underwent title and abstract screening. A total of 312 studies were selected for full-text assessment, of which 184 articles were retained for final synthesis based on relevance, methodological rigor, technical depth, and recency. To improve analytical consistency, studies were grouped into five thematic categories:
- Data infrastructure and storage systems
- Distributed and real-time computation
- Machine learning and AI frameworks
- Deployment, monitoring, and operationalization
- Responsible AI, governance, and sustainability
Although this study follows a narrative review design rather than a formal systematic review, a PRISMA-inspired workflow was adopted to enhance methodological transparency and reproducibility. The literature selection and screening strategy used in this narrative review are summarized in Supplementary Figure S1.
To ensure methodological consistency and full transparency, the final narrative synthesis included 34 studies that met the predefined relevance, technical rigor, recency, and infrastructure-to-deployment applicability criteria. Supplementary Table S1 provides the complete list of all 34 included studies, together with publication year, thematic category, methodological focus, study design, and primary contribution area. These studies formed the complete evidence base for the five thematic domains analyzed in this review, including data infrastructure, distributed systems, machine learning operations, responsible AI, and sustainable deployment. The PRISMA-inspired screening workflow, the main text, and the supplementary materials have been revised to consistently reflect this final inclusion count.
To improve methodological consistency, screening decisions were independently evaluated at multiple stages, and disagreements regarding study inclusion were resolved through iterative discussion and consensus-based assessment of methodological relevance, technical rigor, and contribution to the thematic synthesis. Although a formal quantitative meta-analysis was not conducted, included studies were qualitatively appraised based on publication venue quality, methodological transparency, technical reproducibility, empirical validation, scalability evaluation, and relevance to modern data science infrastructure and deployment ecosystems. Greater interpretive emphasis was assigned to peer-reviewed studies, large-scale industrial deployments, and widely adopted open-source frameworks.
To improve interpretive transparency, a structured qualitative appraisal rubric was applied during full-text assessment. Each included study was evaluated across six criteria: (1) publication venue quality, (2) methodological transparency, (3) technical reproducibility, (4) empirical validation strength, (5) scalability and deployment relevance, and (6) alignment with modern infrastructure-to-deployment ecosystems. Each criterion was assessed using a three-level relevance scale (high, moderate, or low). Studies demonstrating stronger methodological rigor, large-scale implementation evidence, and broader practical applicability received greater interpretive emphasis during synthesis. This appraisal approach supported balanced comparative discussion while reducing narrative selection bias.
Data Analytics
Fayyad and colleagues formally defined the knowledge discovery in databases (KDD) process as a systematic framework consisting of data selection, preprocessing, transformation, data mining, and interpretation stages for extracting actionable knowledge from large-scale datasets.9 With these operators at our disposal, we will be able to construct a comprehensive data analytics system that collects data first, extracts information from it, and presents the information to the user.9 Our observations show that there are usually more research articles and technical reports that concentrate on data mining than on other operators, but this does not imply that the other KDD operators are unimportant. The following sections will concentrate on the key KDD process operators shown in Figure 2, which were condensed into three portions (input, data analytics, and output) as well as seven operators (collection, selection, transformation, preprocessing, data mining, assessment, and interpretation).
Sources: Laney (2001),4 Chen et al. (2014),18 Gandomi and Haider (2015).32
Input of Data
The input section contains the gathering, selection, preprocessing, and transformation operations, as seen in Figure 1. These collected data from various data sources will need to be integrated with the target data since the selection operator typically has the responsibility of determining the type of data needed for data analysis and choosing the pertinent information from the databases or collected data. In order to turn the input data into meaningful data, the preprocessing operator plays a different role in identifying, cleaning, and filtering the unneeded, inconsistent, and incomplete data. Following the selection and preprocessing operators, the secondary data’s characteristics may still be in a variety of data formats; as a result, the KDD process must convert them into a format that can be used for data mining, which is done by the transformation operator. The transformation typically uses techniques like dimensional reduction, sampling, coding, or transformation to simplify the data and reduce its scale so that it may be used for data analysis.
The preprocessing procedures of data analysis can be thought of as the data extraction, data cleaning, data integration, data transformation, and data reduction operators.10 It aims to extract valuable information from the raw data (also known as the primary data) and refine it so that subsequent data analytics can use it. These operators must clean up any duplicate copies, incomplete, inconsistent, noisy, or outlier data. These operators will also attempt to minimize the data if it is too big or too complicated to manage. These operators are responsible for identifying and correcting any flaws or omissions in the raw data.
Explainability and Causal Inference in Data Science
Modern data science extends beyond prediction toward interpretability and causal reasoning.
Explainable AI (XAI): Widely adopted techniques include:
- SHAP (Shapley Additive Explanations)
- LIME (Local Interpretable Model-Agnostic Explanations)
- Counterfactual explanations
These approaches improve transparency and trust in black-box models.
Causal Inference: Frameworks such as DoWhy and EconML enable:
- Estimation of treatment effects
- Identification of causal relationships
- Policy and intervention modeling
Integrating causality with machine learning enhances decision-making reliability beyond correlation-based predictions. The overall workflow of modern data science systems is illustrated in Figure 1. Supplementary Table S1 provides the complete list of the 29 studies included in the narrative synthesis, including publication year, thematic category, methodological focus, and primary contribution area.
Data Analysis
Since KDD’s data analysis (Figure 2) is responsible for extracting hidden patterns, rules, or information from the data, the majority of academics in this field use the term “data mining” to explain how they turn the “ground,” or unprocessed data, into actionable knowledge or information. Data problem-specific techniques are not the only data mining techniques.11 In actuality, the data has been analyzed for many years using different methods (such as statistical or machine learning technologies). Early on in the data analysis process, statistical techniques were employed to analyze data, such as public opinion polls or TV show ratings, to assist us in comprehending the position we are in.
Some of the domain-specific methods are also developed once the data mining challenge is given. One helpful algorithm created for the association rules problem is the apriori algorithm.6 While the computing costs are rather expensive, the problems are straightforward. Machine learning,6 metaheuristic algorithms,12 and distributed computing8 were employed either by themselves or in conjunction with conventional data mining algorithms to provide more effective methods for resolving the data mining problem and speed up the reaction time of a data mining operator (Table 3). One of the most well-known data mining issues is clustering, since it may be applied to comprehend the “new” incoming data. This problem’s fundamental notion7 is to divide a set of unlabeled input data into k distinct groups, like k-means.8
| Table 3: Major data mining problems and algorithms discussed in the manuscript. | ||||
| Data Mining Problem | Objective | Major Algorithms | Evaluation Metric | Application Example |
| Clustering | Group unlabeled data | K-means, Genetic K-means | SSE | Customer segmentation |
| Classification | Predict class labels | SVM, Naïve Bayes, Decision tree | Accuracy | Disease diagnosis |
| Association rules | Discover relationships | A priori, FP-growth | Support, Confidence | Market basket |
| Sequential pattern mining | Time-sequence discovery | GSP, SPADE | Sequential Support | User behavior |
| Dimensional reduction | Reduce data complexity | PCA | Variance explained | Feature compression |
| Summarization | Simplify interpretation | Visualization tools | Interpretability | Search engines |
In contrast to clustering, Breck et al.8 uses a set of labeled input data to create a set of classifiers, or groups, which are then used to classify the unlabeled input data into the groups that they belong to. In recent years, support vector machines (SVMs),7 naïve Bayesian classification,9 and decision tree-based algorithms7 have been employed extensively to tackle the classification problem. The goal of association rules and sequential patterns is to identify the “relationships” between the input data. Finding all co-occurrence links between the input data is the fundamental principle of association rules.
Regarding the problem of association rules, the apriori algorithm is among the most often used techniques. However, due to its high computational cost, subsequent research has tried to employ alternative methods to lower the a priori algorithm’s cost, like using the genetic algorithm for this problem. It will be called the sequential pattern mining problem if we take into account the sequence or time series of the input data, in addition to the relationships between them. It was solved by several a priori-like methods, including sequential pattern discovery utilizing equivalence classes, and provided the outcome.
Two essential operators of the output are evaluation and interpretation. Evaluation usually serves as a gauge for the outcomes. Additionally, it could be one of the data mining algorithm’s operators, such as the sum of squared errors, which is used for the selection of an operator of the clustering problem’s genetic algorithm.10 The work to navigate and explore the meaning of the results from the data analysis to further support the user to make the appropriate decision can be regarded as the interpretation operator,11 which typically provides a useful interface to display the information.
These are the two crucial research topics after something (such as classification rules) is found by data mining methods, generalized sequential patterns,12 and to help the user comprehend the information from the data analysis, a meaningful summary of the mining findings can be created. Generally speaking, the data summary is anticipated to be one of the humans struggle to comprehend large volumes of complex information, so there are straightforward ways to give the consumer a little piece of information. The clustering search engine provides a basic data summarization. When the query “oasis” is sent to Carrot2 (http://search.carrot2.org/stable/search), it returns some keywords to represent each group of the clustering results for web links, assisting us in determining which category the user needs. A layered architecture integrating storage, processing, analytics, and application layers is shown in Figure 3.
Sources: Microsoft (2022), Databricks (2022), Abadi et al. (2016).
Big Data Analytics
These days, the data that needs to be examined is not only big but also made up of several kinds of data, including streaming data.13 Given that big data is “massive, high-dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and inaccurate,” which could alter statistical and data analysis methods.13 The truth is that more data does not always equate to more valuable information, despite the appearance that big data makes it feasible for us to gather more data in order to locate more useful information. It might have more unclear or unusual information. The accuracy of the mining results may be compromised, for example, if a person has multiple accounts or if multiple people use an account.13
As a result, several additional problems for data analytics arise, including fault tolerance, storage, security, privacy, and data quality.14,15 Big data can be produced via mobile devices, social networks, the Internet of Things,16,17 multimedia, and numerous other emerging applications that have the traits of volume, velocity, and variety.18 The majority of data was created digitally and is now shared online due to the rapid growth of information technology. “Lyman and Varian estimated that by 2002, more than 92% of newly generated information was stored digitally, highlighting the rapid acceleration of global data generation and storage growth.”19 The expanded characteristics of big data, extending beyond Laney’s original 3Vs4 framework, are summarized in Figure 2.20
Big Data Analysis Frameworks and Platforms
Chen et al. provided one of the earliest comprehensive surveys categorizing big data analytics frameworks into processing, storage, and analytics components, thereby establishing a foundational systems perspective for scalable analytics infrastructures.18 (1) Processing/Compute: Hadoop,16 Nvidia CUDA,17 or Twitter Storm18; (2) Storage: Titan or HDFS; and (3) Analytics: MLPACK19 or Mahout.21 The majority of research on conventional data analysis focuses on the design and development of efficient and/or effective “ways” to extract usable information from the data, despite the existence of commercial data analysis solutions.16–19,21 However, most of the existing computer systems won’t be able to manage the entire information at once when we enter the big data era; thus, the question of how to create an effective data analytics framework arises.
Summary of the Process of Big Data Analytics
Nevertheless, several emerging challenges continue to affect both the input and output stages of modern data science systems. Bottlenecks may arise not only at data acquisition and sensing layers, but also within downstream analytics, distributed processing, and inference infrastructure. An exemplary case that we discussed in “Big data input” is that the bottleneck may occur not just on the input or sensing devices, but even in further data analytics locations.22 Traditional compression and sampling techniques can be used to address this issue; they are only able to lessen the issues rather than fully resolve them. There are similar circumstances in the output as well. There are still several new problems in the big data era, such as information fusion from various information sources or information accumulation from different times, even though several metrics can be used to assess the performance of the frameworks, platforms, and even data mining algorithms.
Numerous studies have made an effort to provide an effective or efficient solution at the algorithm or system level (e.g., framework and platform). Using machine learning as the search algorithm (i.e., mining algorithm) for the data mining issues of big data analytics systems is a promising trend that is readily apparent from these successful examples. Machine learning-based techniques can improve the intelligence of mining algorithms and pertinent platforms or eliminate unnecessary computation expenses. The following further demonstrates how parallel computing and cloud computing technologies have a significant influence on big data analytics: (1) the majority of big data analytics frameworks and platforms use Hadoop and related technologies to design their solutions; and (2) the majority of big data analysis mining algorithms have been developed for MapReduce-based platforms or parallel computing via software or hardware.
According to the findings of recent research, big data analytics is still in the early stages of Nolan’s phases of growth model,23 which is comparable to the circumstances surrounding the study of cloud computing, the internet of things, and smart grids. This is due to several studies that only tried to adapt the conventional approaches to the new issues, platforms, and settings. For instance, many studies13 used k-means as an example to evaluate massive data, but few studies used cutting-edge machine learning and data mining methods. This illustrates how data mining techniques and metaheuristic algorithms introduced in recent years can enhance big data analytics performance.24 The pertinent technologies, instance sampling, compression, or even the platform that has been introduced recently, could be utilized to improve the big data analytics system’s performance. Consequently, even though these research areas still have several unresolved issues, these situations also show that everything is possible in these disciplines.
Modern Data Science Ecosystem: Infrastructure-to-Deployment Taxonomy
To address fragmentation in existing literature, we propose a unified taxonomy connecting data infrastructure, analytics, and deployment layers (Figure 4).
Unlike prior surveys that primarily focus on isolated components such as distributed storage, machine learning frameworks, or cloud analytics independently, the proposed taxonomy explicitly integrates infrastructure, processing, analytics, deployment, governance, and sustainability layers into a single end-to-end architectural framework. The taxonomy further introduces cross-layer decision dependencies linking latency requirements, governance constraints, operational complexity, scalability, observability, and responsible AI considerations. This integrated perspective enables practitioners to evaluate architectural trade-offs holistically rather than optimizing individual subsystems in isolation (Table 4).
| Table 4: Comparative positioning of the proposed unified taxonomy against existing frameworks. | |||
| Framework Type | Primary Focus | Limitation | Contribution of Proposed Taxonomy |
| Hadoop-era architectures | Distributed storage and batch processing | Limited real-time and AI lifecycle integration | Adds stream processing, AI deployment, and observability |
| Cloud-native analytics frameworks | Elastic infrastructure | Weak integration with governance and fairness | Integrates responsible AI and compliance layers |
| MLOps pipelines | Model deployment lifecycle | Often disconnected from upstream infrastructure decisions | Connects infrastructure-to-deployment dependencies |
| Responsible AI frameworks | Fairness and governance | Limited operational systems perspective | Embeds governance across all architectural layers |
| Proposed taxonomy | End-to-end modern data science ecosystem | Unified multi-layer perspective | Integrates scalability, governance, deployment, observability, sustainability, and AI operations |
Existing lakehouse architectures primarily focus on storage optimization, transactional reliability, and analytical query performance, while MLOps frameworks emphasize model deployment pipelines and lifecycle management. Responsible AI frameworks, in contrast, focus largely on fairness, explainability, accountability, and governance mechanisms. The proposed unified taxonomy extends beyond these isolated perspectives by explicitly integrating infrastructure design, distributed computation, model operationalization, observability, governance, compliance, and sustainability into a single end-to-end framework. More importantly, it introduces cross-layer decision dependencies, where choices made at the infrastructure layer directly influence deployment reliability, latency constraints, fairness monitoring, compliance requirements, and environmental impact. This systems-oriented integration distinguishes the proposed framework from prior descriptive surveys and provides a more practical architecture-level decision model for modern data science ecosystems.
Data Infrastructure Layer: Modern systems rely on scalable storage and table formats such as Apache Parquet, Apache Arrow, Delta Lake, Apache Iceberg, and Apache Hudi, enabling ACID transactions and efficient columnar processing. Query engines like DuckDB enable fast in-process analytics.
Distributed Processing Layer: Next-generation distributed frameworks extend beyond Hadoop MapReduce:
- Apache Spark 3.x: optimized DAG execution and adaptive query planning
- Apache Flink and Kafka Streams: real-time stream processing
- Ray and Dask: scalable Python-native parallel computing
Machine Learning and Analytics Layer: Modern analytics integrates:
- Deep learning frameworks
- Data-centric AI approaches emphasizing data quality
- Automated feature engineering and model selection
Deployment and MLOps Layer: Production systems require:
- Experiment tracking (MLflow)
- Pipeline orchestration (Kubeflow)
- Feature stores and model registries
- Deployment and MLOps Layer
Feature stores provide centralized repositories for reusable, version-controlled machine learning features across training and inference environments, thereby reducing training-serving skew and improving reproducibility. Model registries manage versioning, lineage tracking, governance, and deployment states of machine learning models throughout the production lifecycle. MLOps frameworks integrate these components with automated testing, continuous integration, deployment orchestration, and monitoring pipelines to support scalable and maintainable AI systems.
Monitoring Systems for Drift, Bias, and Performance
Reliability Engineering and Technical Debt in Production ML
Despite rapid advances in machine learning deployment, production systems frequently experience hidden technical debt arising from data drift, inconsistent feature pipelines, undocumented dependencies, and unstable retraining workflows. Failures often emerge from non-model components, including orchestration systems, monitoring infrastructure, and data integration layers. Common failure modes include:
- Concept drift and data distribution shift
- Training-serving skew
- Feature inconsistency across environments
- Pipeline fragility and dependency failures
- Monitoring blind spots and delayed incident detection
Modern reliability engineering practices therefore incorporate:
- Continuous integration/continuous deployment (CI/CD)
- Automated validation pipelines
- Drift monitoring and observability
- Canary deployments and rollback mechanisms
- Infrastructure-as-code and reproducible workflows
These practices are increasingly recognized as essential for robust and trustworthy AI deployment.
Retrieval and AI Systems
Recent advances include:
- Vector databases for semantic search
- Retrieval-augmented generation (RAG) pipelines
- Large language model (LLM) integration
Trade-offs and System Design Considerations
- Dimension—Trade-off
- Cost—Cloud scalability versus operational expense
- Performance—Batch versus real-time latency
- Reliability—Fault tolerance versus system complexity
- Governance—Flexibility versus compliance constraints
The comparative positioning of Apache Spark, Apache Flink, Ray, Dask, DuckDB, and Trino was developed through expert synthesis supported by peer-reviewed systems literature, industrial deployment reports, and official technical documentation. Evaluation dimensions included latency characteristics, throughput scalability, orchestration complexity, operational maturity, resource efficiency, and suitability for heterogeneous machine learning workloads. Because direct benchmarking across all frameworks under identical experimental conditions remains challenging due to architectural differences, these comparisons are presented as workload-oriented guidance rather than absolute performance rankings. Framework suitability therefore depends primarily on operational context, latency requirements, infrastructure expertise, and governance constraints rather than universal computational superiority. This taxonomy provides a practical blueprint for designing end-to-end data science systems (Table 5).
| Table 5: Comparative evaluation of contemporary data science frameworks. | ||||||
| Framework | Strengths | Limitations | Best Use Case | Scalability | Latency | Operational Complexity |
| Apache Spark | Mature ecosystem, strong batch analytics | Higher memory consumption | Large-scale ETL and ML pipelines | High | Medium | Medium |
| Apache Flink | True stream-native architecture | Steeper learning curve | Real-time analytics and IoT | High | Low | High |
| Ray | Python-native distributed computing | Ecosystem still evolving | AI workloads and LLM orchestration | High | Medium | Medium |
| Dask | Lightweight and Python-friendly | Less optimized for massive clusters | Scientific computing | Medium | Medium | Low |
| DuckDB | Fast local analytical queries | Not intended for ultra-large distributed systems | Embedded analytics | Medium | Low | Low |
| Trino | High-performance federated querying | Requires infrastructure tuning | Lakehouse analytics | High | Medium | High |
Comparative benchmarking studies indicate substantial architectural trade-offs among modern distributed analytics frameworks. Apache Flink consistently demonstrates lower end-to-end latency for event-driven stream processing workloads, whereas Apache Spark generally provides superior ecosystem maturity and optimized batch analytics performance for large-scale ETL pipelines. Ray and Dask offer improved flexibility for Python-native AI development, although their orchestration overhead may increase under highly heterogeneous distributed environments. DuckDB achieves exceptionally efficient local analytical query execution through vectorized processing, but is less suitable for ultra-large distributed deployments. These observations highlight that framework selection is strongly dependent on workload characteristics, latency requirements, operational expertise, and infrastructure constraints rather than raw computational performance alone.
Mini-Case Syntheses Across Domains
Healthcare Analytics
Healthcare systems increasingly rely on distributed machine learning pipelines for disease prediction, medical imaging, and patient risk stratification. Real-time analytics frameworks combined with federated learning improve predictive performance while preserving patient privacy. However, deployment challenges include data heterogeneity, regulatory compliance, and model interpretability. Federated learning deployments in healthcare environments have demonstrated improved collaborative model training across hospitals without direct patient-level data sharing, thereby reducing privacy risks while maintaining predictive utility in diagnostic imaging and clinical risk prediction tasks.
Financial Analytics
Financial institutions utilize streaming architectures and low-latency inference systems for fraud detection, credit scoring, and algorithmic trading. Kafka-based streaming pipelines and scalable feature stores enable rapid decision-making under strict latency constraints. Reliability engineering and continuous monitoring are critical because minor prediction failures can produce substantial financial risk. Real-time fraud detection systems deployed by financial institutions increasingly utilize Kafka-centered streaming pipelines capable of millisecond-scale inference, enabling rapid anomaly detection and adaptive risk scoring under high transaction throughput conditions.
Recommender Systems
Modern recommendation platforms integrate vector databases, embedding models, and retrieval-augmented generation (RAG) architectures to improve personalization. These systems require efficient retrieval pipelines, scalable inference infrastructure, and drift monitoring due to rapidly evolving user behavior. Large-scale recommender systems incorporating vector embeddings and retrieval-augmented generation architectures have demonstrated improved semantic personalization and contextual retrieval quality across dynamic user interaction environments.
Internet of Things (IoT)
IoT environments generate high-velocity streaming data from sensors and edge devices. Apache Flink and edge computing architectures support real-time anomaly detection and predictive maintenance. Key challenges include fault tolerance, synchronization, and energy-efficient processing. Industrial IoT deployments using stream-native architectures have shown measurable reductions in equipment downtime through predictive maintenance pipelines operating on continuous sensor telemetry. These domain-specific examples demonstrate that architectural choices are strongly influenced by latency requirements, governance constraints, scalability demands, and operational complexity.
Privacy Concerns
Responsible Data Science: Privacy, Fairness, and Governance
As data systems scale, ethical and regulatory considerations become central.
Privacy-Preserving Techniques
- Differential privacy
- Homomorphic encryption (HE)
- Secure multi-party computation (MPC)
- Federated learning
Fairness and Bias: Bias can arise from:
- Data imbalance
- Historical inequities
- Model design
Mitigation strategies include fairness-aware learning and bias auditing
Security and Compliance: Modern systems must comply with regulations (e.g., GDPR-like frameworks) and incorporate:
- Data encryption
- Access control
- Secure pipelines
Model Monitoring and Reliability: Continuous monitoring ensures:
- Drift detection
- Performance degradation tracking
- Explainability over time
Fairness Metrics and Continuous Bias Monitoring
Modern responsible AI systems increasingly employ quantitative fairness metrics to evaluate model behavior across demographic groups. Commonly used metrics include:
- Demographic parity
- Equal opportunity difference
- Equalized odds
- Calibration fairness
- Disparate impact ratio
Demographic parity evaluates whether prediction outcomes are statistically independent of protected attributes, whereas equal opportunity measures consistency in true positive rates across demographic groups. Equalized odds extends this principle by jointly evaluating true positive and false positive parity. Calibration fairness assesses whether predicted probabilities maintain equivalent interpretive meaning across subpopulations, while the disparate impact ratio quantifies proportional differences in decision outcomes among protected groups. Operational AI systems also require continuous fairness monitoring because model behavior may drift over time due to changing data distributions. Bias auditing frameworks and explainability tools are therefore integrated into MLOps pipelines to support transparent and accountable AI governance.
Environmental Considerations (Green AI)
Large-scale models consume significant energy. Efficient architectures and model compression are essential to reduce environmental impact. This integrated perspective ensures responsible and sustainable deployment of data science systems. Recent studies have highlighted the substantial computational and environmental costs associated with large-scale foundation models and distributed AI infrastructure. Consequently, modern sustainable AI research emphasizes:
- Energy-efficient model architectures
- Quantization and pruning techniques
- Sparse training methods
- Carbon-aware scheduling
- Efficient hardware acceleration
Environmental impact assessment is increasingly becoming an important component of AI governance frameworks and cloud infrastructure planning. To improve operational reproducibility, responsible AI evaluation should include standardized reporting practices for both fairness and environmental sustainability. Energy consumption may be quantified using GPU-hours, power utilization effectiveness (PUE), carbon intensity of compute regions, and estimated CO2-equivalent emissions per training or inference cycle.
Carbon-aware scheduling can further reduce environmental impact by shifting workloads toward lower-intensity energy windows or geographically cleaner energy regions. Fairness metric selection should also remain task-dependent; demographic parity may be suitable for access allocation problems, whereas equal opportunity and equalized odds are often more appropriate for healthcare diagnosis, lending decisions, and fraud detection systems where false positive and false negative imbalance carries major societal consequences.
Implications
The amount of data is increasing globally at a rate of about 50% annually, or almost 40 times since 2001, according to a 2011 McKinsey industry report consistent with early large-scale information growth estimates reported by Lyman and Varian.18 Every day, millions of films are published on the Internet, and hundreds of billions of messages are sent over social media. Businesses typically connect a positive option value with data—that is, since it might prove helpful in ways not yet anticipated, why not just keep it? So much of it is stored when storage becomes nearly free (The ability to store all of the world’s music on a $500 device is one sign of how cheap storage is these days).
In the 1980s, it became feasible to make decisions using large-scale datasets. As relational database technology advanced and business procedures became more automated, the field of data mining flourished in the early 1990s. Early data mining books from the 1990s25,26 explained how different machine learning techniques may be used to solve a range of business issues. Software products designed to use transactional and behavioral data for prediction and explanation saw a commensurate increase. One key takeaway from the 1990s is that machine learning demonstrates robust predictive capability in the sense that these techniques can fairly easily identify subtle structure in data without requiring significant assumptions about linearity, monotonicity, or distribution characteristics. The drawback of these techniques is that they also detect data noise, frequently without being able to differentiate between signal and noise.
There are many benefits to approaches that do not require us to assume anything about the nature of the relationship between variables before we start our investigation, notwithstanding their shortcomings. This is not insignificant. The majority of us are taught to think that theories must come from the human mind based on earlier theories, with data obtained to prove the theories’ correctness. This process is reversed by machine learning. The computer teases us by stating, “If only you knew what question to ask me, I would give you some very interesting answers based on the data,” when presented with a vast amount of data.
We frequently don’t know what questions to ask; having such a capability is powerful. For instance, take a look at a database of people who have been utilizing the healthcare system for a long time. Of these, some have been diagnosed with Type 2 diabetes, and a portion of them have experienced difficulties. Knowing whether there are any trends in the problems and whether it is possible to forecast the likelihood of problems and take appropriate action could be very helpful. It is challenging to determine which precise question would disclose such trends, though.
The data coming from a health-care system, which is fundamentally made up of “transactions,” or points of contact across time between a patient and the system, can help to make this situation more tangible. Records may include notes and observations, as well as services provided by medical professionals or drugs administered on a specific date. A “clean period” (history before diagnosis), a red bar (“diagnosis”), and the “outcome period” (costs and various outcomes, including complications), which depict the raw data for 10 individuals. The first person was taking seven different drugs before being diagnosed, the second was taking nine, the third was taking six, and so on. Each colored bar in the clean phase symbolizes a medication. The first three patients (shown by the upward-pointing green arrows) and the sixth and tenth patients were the most expensive to treat and experienced complications.
Even with such a small temporal database, it is not easy to extract intriguing patterns. Are the gray or yellow medications linked to complications? Without the blues, the yellows? Or is it more than three blues or three yellows? The list is endless. More importantly, might doctors, insurance companies, or policymakers forecast potential problems for individuals or groups if we developed “useful” features or aggregations from the raw data? One crucial creative stage in the process of discovering new information is feature development. Usually, the raw data from multiple people needs to be combined into a canonical format before useful patterns can be discovered. Assume, for instance, that we could approximate a person’s “health status” before diagnosis by counting the number of medicines they are taking, regardless of the details of each prescription. Such aggregation is typical of feature engineering, even though it overlooks the “severity” or other aspects of the individual drugs.
Assume, too, that a “complications database” would be created from the data, potentially containing demographic data (such as patient age and medical history); it might also include health status based on a count of current medications.26 The computer typically plays a major role in model development and decision-making when predicted accuracy is the key goal in areas containing vast volumes of data. In other words, it automates Popper’s criterion of predictive accuracy for evaluating models at a scale that was previously impractical. The computer itself may construct predictive models through an intelligent “generate and test” process, culminating in an assembled model that is the decision maker.
Can we say that “poor health status” causes difficulties if we take into account one of these patterns—that individuals with “poor health status” (proxied by the number of prescriptions) have a high incidence of complications? If this is the case, we might be able to change the course of events by limiting the quantity of drugs; “It depends” is the response. It is possible that our observed set of variables may not contain the true reason. There are techniques available to derive causal structure from data, depending on how the data was obtained, if we assume we have seen all pertinent variables that might be creating issues.
In particular, to determine if the notion of causation may and should be considered, even in theory, we still need to have a thorough grasp of the “story” underlying the facts. Was it true, for example, that patients with Type 2 diabetes over 36 who were on seven or more drugs were “inherently sicker” and would have experienced difficulties regardless? If this is the case, concluding that taking a lot of drugs leads to problems might be inaccurate. On the other hand, it might be possible to extract a causal model that could be used for intervention if the observational data followed a “natural experiment” in which treatments were randomly assigned to comparable individuals and sufficient data is available for computing the pertinent conditional probabilities.
Skill
As businesses traverse the deluge of data and attempt to create automated decision systems that rely on predictive accuracy, machine learning abilities are quickly becoming essential for data scientists.26 In today’s industry, foundational machine learning training is essential. Additionally, given the proliferation of text and other unstructured data in social networks, health-care systems, and other forums, understanding text processing and “text mining” is becoming crucial. Understanding markup languages, such as XML and its variations, is particularly crucial since they allow computers to automatically read text by tagging it.
The first is statistics, particularly Bayesian statistics, which necessitates a working grasp of probability, distributions, hypotheses, and multivariate analysis. Data scientists’ understanding of machine learning must be built upon these foundational skills. Econometrics, which focuses on fitting reliable statistical models to economic data, frequently intersects with multivariate analysis. Multivariate analysis and econometrics generally concentrate on estimating the parameters of linear models where the relationship between the dependent and independent variables is expressed as a linear equation, in contrast to machine learning techniques that make few or no assumptions about the functional form of relationships among variables.
The second set of abilities is derived from computer science and concerns the internal representation and manipulation of data by computers. This entails a series of classes on systems, methods, and data structures, including databases, distributed computing, parallel computing, and fault-tolerant computing. Systems expertise and scripting languages (like Python and Perl) are the basic building blocks needed to work with datasets of a decent scale. However, traditional database systems based on the relational data model are severely limited in their ability to handle very big datasets. A new set of competencies for data scientists is indicated by the current shift toward cloud computing and nonrelational architectures for handling massive datasets in a reliable way.
The third class of abilities is fundamental to almost all data modeling exercises and necessitates an understanding of correlation and causality. We can be fortunate even if observational data usually restricts us to correlations. Natural randomized trials and the ability to compute conditional probabilities with reliability may occasionally be represented by abundant data, allowing for the identification of causal structure.27 In areas where one has a sufficient level of confidence regarding the stability and completeness of the developed model, or whether the causal model “generating” the observed data is stable, it is useful to build causal models. A data scientist should, at least, be able to distinguish between correlation and causality and determine which models are desirable, practicable, and viable in certain contexts.
The capacity to articulate problems in a way that leads to successful answers is the final skill set, which is the least standardized, somewhat elusive, and somewhat of a craft, but also a crucial differentiator to be a good data scientist. In terms of decision-making, we are entering a big data era when computers are intrinsically superior to people for a wide range of issues; “better” could be defined in terms of cost, accuracy, and scalability. In the field of data-intensive finance, where computers make most investment decisions—often in a matter of seconds—as new information becomes available, this change has already taken place.10 The same is true for online advertising, where millions of auctions are completed every day in milliseconds, air traffic control, package delivery routing, and numerous other planning tasks that call for simultaneous scale, speed, and accuracy—a trend that is expected to pick up speed in the coming years.28
Design Playbook for Modern Data Science Systems
To support practical decision-making, Table 6 summarizes architecture recommendations under different operational constraints.
| Table 6: Open issues, challenges, and future research directions in big data analytics. | |||
| Challenge | Problem Identified in Manuscript | Current Limitation | Future Research Direction |
| Privacy | Personal information leakage | Weak anonymization | Federated learning |
| Security | Unauthorized access | Data breaches | Blockchain security |
| Data quality | Noisy, incomplete data | Poor prediction accuracy | Automated data cleaning |
| Storage | Massive volume growth | Infrastructure overload | Scalable cloud systems |
| Fault tolerance | System failures | Processing interruptions | Self-healing architectures |
| Interpretability | Black-box ML models | Lack of trust | Explainable AI |
| Causality | Correlation confusion | Wrong interventions | Causal inference models |
| Scalability | Increasing computational cost | Performance bottlenecks | Quantum analytics |
Worked Example: Real-Time Fraud Detection in Financial Systems
Consider a financial institution requiring fraud detection with sub-second inference latency, continuous transaction monitoring, regulatory compliance, and fairness auditing across customer groups. Under these constraints, Apache Flink combined with Kafka Streams supports low-latency event-driven stream processing, while centralized feature stores ensure consistency between model training and real-time inference. MLflow and Kubeflow provide model versioning, deployment orchestration, rollback capability, and reproducibility.
Continuous monitoring includes drift detection, calibration assessment, bias auditing, and performance degradation tracking. Explainability requirements are addressed through SHAP-based interpretation, while immutable audit logs support compliance and governance obligations. Carbon-aware scheduling and efficient model compression further reduce infrastructure costs and environmental impact. This example demonstrates how infrastructure selection, deployment reliability, fairness monitoring, governance requirements, and sustainability considerations must be optimized jointly rather than independently (Table 7).
| Table 7: Architecture decision framework for contemporary data science systems. | |
| Constraint | Recommended Architecture |
| Low-latency real-time inference | Apache Flink + Kafka Streams |
| Large-scale batch analytics | Apache Spark + Delta Lake |
| Python-native distributed AI | Ray or Dask |
| Strong governance and compliance | Lakehouse + Federated learning |
| Cost-sensitive analytics | DuckDB + Object storage |
| Large-scale LLM applications | Vector database + RAG + GPU orchestration |
| Privacy-sensitive healthcare analytics | Federated learning + Differential privacy |
This design-oriented framework translates theoretical concepts into deployable architectural guidance for researchers and practitioners. Collectively, the proposed unified taxonomy and accompanying design playbook extend beyond conventional descriptive surveys by integrating infrastructure engineering, distributed analytics, AI operationalization, governance, observability, and sustainability considerations into a cohesive systems-level framework. This integration enables researchers and practitioners to evaluate architectural trade-offs across the entire data science lifecycle, thereby supporting the development of scalable, reliable, interpretable, and operationally sustainable data-driven systems.
Concluding Perspective
We examined research on data analytics in this work, ranging from conventional data analysis to the more current big data analysis. Three components make up the KDD process, which serves as the basis for these investigations from a system perspective: input, analysis, and results. The performance-oriented and results-oriented challenges are the main topics of discussion from the standpoint of the big data analytics framework and platform. This paper provides a brief overview of data and big data mining algorithms, including clustering, classification, and common patterns mining technologies, from the standpoint of data mining problems. We have benefited much from hypothesis-driven research and methods for developing theories.
However, there is a lot of data coming from our surroundings where these conventional methods of identifying structure do not scale well or take advantage of observations that would not occur under controlled circumstances. For instance, controlled experiments have helped identify many disease causes in the medical field, but they might not accurately reflect the complexity of health.20,29 In fact, some estimates state that up to 80% of the circumstances in which a drug might be prescribed—such as when a patient is taking numerous medications—are excluded from clinical trials. Big data makes it possible to identify the causal models producing the data when we are able to run randomized experiments.
Big data makes it possible for a machine to ask and validate intriguing questions that people might not think of, as demonstrated earlier in the diabetes-related health-care example. In fact, this capacity serves as the basis for developing predictive modeling, which is essential for making practical business decisions.30 Data offers a previously unheard-of possibility for knowledge discovery and theory creation in many data-starved fields of study, particularly health care and the social, ecological, and earth sciences. The diversity and scope of data currently available in these areas are unprecedented.
The integrated skill set described here as crucial for young data scientists is required in this new environment. A portion of these abilities is taught in computer science, engineering, and business management schools, but the integration of skills required to work as a data scientist or effectively manage data scientists has not yet been covered. Universities are rushing to fill the gaps and offer a more comprehensive skill set that includes fundamental knowledge of computer science, statistics, causal modeling, problem formulation, isomorphs, and computational thinking.
The business models of Internet-based, data-driven companies increasingly rely on predictive modeling and machine learning. Due to its capacity to anticipate the distribution of losses for each transaction and take appropriate action, PayPal, an early success, was able to capture and dominate consumer-to-consumer payments. This data-driven ability was in sharp contrast to the prevailing practice of treating transactions identically from a risk standpoint. Google’s search engine and a number of other products are based on predictive modeling. However, IBM’s Watson, which heavily relies on learning and prediction in its problem-solving process, is probably the first machine to pass the Turing test and generate discoveries.31
In a game like “Jeopardy!,” where the domain is open-ended and nonstationary, and the question itself is frequently difficult to understand, it is impractical to succeed through a lengthy list of options or top–down theory development. Giving a computer the capacity to automatically train itself using a vast number of instances is the answer. Watson also showed how the availability of excellent, human-curated data, such as that found on Wikipedia, significantly increases the capacity of machine learning. Combining machine learning with human knowledge is another trend that seems to be growing. In order to help the machine comprehend the entities that correlate to the deluge of strings it continuously processes, Google has ventured into the Knowledge Graph.32
Google seeks to comprehend “things,” not merely “strings.”28 Managers and organizations have a difficult time adjusting to the new data landscape. Many of their well-established intuitions can now be tested, experiments can be conducted correctly and affordably, and decisions can be made based on evidence. Organizations that have embraced the rising world of data for decision-making exhibit a fundamental transformation in organizational culture, which is necessary to take advantage of this potential. Supplementary Appendix A maps all figures, architectural layers, comparative tables, and representative frameworks to their corresponding foundational and contemporary references to improve reproducibility, citation traceability, and technical interpretability.
Declarations
Ethics approval and consent to participate: This review was conducted in accordance with ethical standards for academic research and publication. The authors confirm that no human participants or animals were involved in the creation of this review paper, and therefore, ethical approval was not required. All sources and references have been properly cited to acknowledge the contributions of other researchers. The authors have adhered to best practices for transparency, integrity, and objectivity in the preparation and presentation of this review.
References
- Lyman P, Varian HR. How Much Information? University of California; 2003.
- Zaharia M, Chen A, Davidson A, et al. Delta Lake: high-performance ACID table storage over cloud object stores. Proc VLDB Endow. 2021;14(12):3411–3424. https://doi.org/10.14778/3476311.3476364
- Armbrust M, Ghodsi A, Xin RS, Zaharia M. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. CIDR; 2021.
- Laney D. 3D data management: controlling data volume, velocity and variety. META Group Research Note; 2001.
- Akidau T, Bradshaw R, Chambers C, et al. The dataflow model. Proc VLDB Endow. 2015;8(12):1792–1803. https://doi.org/
10.14778/2824032.2824076 - Karau H, Warren R. High Performance Spark. O’Reilly Media; 2021.
- Zaharia M, Xin RS, Wendell P, et al. Apache Spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65. https://doi.org/10.1145/2934664
- Breck E, Polyzotis N, Roy S, et al. Data validation for machine learning. MLSys. 2019.
- Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Magazine. 1996;17(3):
37–54. https://doi.org/10.1609/aimag.v17i3.1230 - Sculley D, Holt G, Golovin D, et al. Hidden technical debt in machine learning systems. NeurIPS. 2015;2:2503–2511.
- Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and risks of foundation models. arXiv. Preprint posted online August 16, 2021. https://doi.org/10.48550/arXiv.2108.07258
- Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv. Preprint posted online May 22, 2020. https://doi.org/10.48550/arXiv.2005.11401
- Kairouz P, McMahan HB, Avent B, et al. Advances and open problems in federated learning. Found Trends Mach Learn. 2021;14(1–2):1–210. https://doi.org/10.1561/2200000083
- Chen T, Guestrin C. XGBoost: a scalable tree boosting system. KDD, 2016;785–794. https://doi.org/10.1145/2939672.2939785
- Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–2830.
- Abadi M, Barham P, Chen J, et al. TensorFlow: a system for large-scale machine learning. OSDI. 2016;265-283.
https://doi.org/10.5555/3026877.3026899 - Meng X, Bradley J, Yavuz B, et al. MLlib: machine learning in Apache Spark. J Mach Learn Res. 2016;17:1235–1241.
- Chen M, Mao S, Liu Y. Big data: a survey. Mobile Netw Appl. 2014;19:171–209. https://doi.org/10.1007/s11036-013-0489-0
- Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?” Explaining predictions of any classifier. KDD. 2016;1135–1144. https://doi.org/10.1145/2939672.2939778
- Imbens GW, Rubin DB. Causal inference in statistics, social,
and biomedical sciences. Cambridge University Press; 2015. https://doi.org/10.1017/CBO9781139025751 - Lundberg SM, Lee SI. A unified approach to interpreting
model predictions. NeurIPS. 2017. https://doi.org/10.48550/arXiv.1705.07874 - Molnar C. Interpretable machine learning. 2nd ed. Shroff/Molnar; 2022.
- McMahan HB, Moore E, Ramage D, et al. Communication-efficient learning of deep networks from decentralized data. AISTATS. 2017. https://doi.org/10.48550/arXiv.1602.05629
- Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and risks of foundation models. arXiv. Preprint posted online August 16, 2021. https://doi.org/10.48550/arXiv.2108.07258
- Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. NeurIPS. 2020. https://doi.org/10.48550/arXiv.2005.14165
- Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. NeurIPS. 2017. https://doi.org/10.48550/arXiv.1706.03762
- Breck E, Cai S, Nielsen E, et al. The ML test score: a rubric for ML production readiness. IEEE Big Data. 2017. https://doi.org/10.1109/BigData.2017.8258038
- Amershi S, Begel A, Bird C, et al. Software engineering for machine learning: a case study. ICSE; 2019. https://doi.org/10.1109/ICSE-SEIP.2019.00042
- Pearl J. Causality: models, reasoning and inference, 2nd ed. Cambridge University Press; 2009. https://doi.org/10.1017/CBO9780511803161
- Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science. 2015;349(6245):255–260.
https://doi.org/10.1126/science.aaa8415 - Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016. https://doi.org/10.7551/mitpress/10277.001.0001
- Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manage. 2015;35(2):137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Appendix
Supplementary Table S1: Included Studies Used in the Narrative Synthesis
| Ref. No. | First Author (Year) | Thematic Category | Methodological Focus | Primary Contribution Area |
| 1 | Lyman (2003) | Digital Information Growth | Global information measurement | Early digital data expansion analysis |
| 2 | Zaharia (2021) | Data Infrastructure & Lakehouse Systems | ACID lakehouse architecture | Delta Lake cloud-native storage |
| 3 | Armbrust (2021) | Unified Analytics Platforms | Lakehouse architecture integration | Unified warehousing and analytics |
| 4 | Akidau (2015) | Distributed Stream Processing | Stream and batch processing | Dataflow computational framework |
| 5 | Laney (2001) | Big Data Foundations | 3Vs conceptual framework | Volume, velocity, and variety |
| 6 | Karau (2021) | Distributed Analytics Optimization | Spark performance engineering | High-performance Spark analytics |
| 7 | Zaharia (2016) | Distributed Computing Frameworks | Cluster-scale analytics engine | Apache Spark unified engine |
| 8 | Breck (2019) | MLOps & Reliability Engineering | ML data validation | Production-ready validation pipelines |
| 9 | Fayyad (1996) | Knowledge Discovery & Data Mining | KDD process formalization | Knowledge discovery framework |
| 10 | Sculley (2015) | ML Systems Engineering | Technical debt analysis | Reliability risks in ML systems |
| 11 | Bommasani (2021) | Foundation Models & AI Governance | Foundation model assessment | Risks and opportunities of foundation models |
| 12 | Lewis (2020) | Retrieval-Augmented AI | Retrieval-enhanced NLP pipelines | RAG architecture |
| 13 | Kairouz (2021) | Privacy-Preserving AI | Federated learning survey | Open challenges in federated learning |
| 14 | Chen T (2016) | Machine Learning Algorithms | Gradient boosting optimization | XGBoost scalable learning |
| 15 | Pedregosa (2011) | Machine Learning Frameworks | Open-source ML toolkit | Scikit-learn ecosystem |
| 16 | Abadi (2016) | Deep Learning Infrastructure | Large-scale neural computation | TensorFlow architecture |
| 17 | Meng (2016) | Distributed Machine Learning | Parallelized ML computation | MLlib scalable learning framework |
| 18 | Chen M (2014) | Big Data Systems Survey | Distributed analytics survey | Big data architecture overview |
| 19 | Ribeiro (2016) | Explainable AI | Local model explanations | LIME interpretability framework |
| 20 | Lundberg (2017) | Explainable AI | Feature attribution explainability | SHAP framework |
| 21 | Molnar (2022) | Explainable AI | Interpretable machine learning | Explainability methodologies |
| 22 | McMahan (2017) | Federated Learning | Communication-efficient optimization | Decentralized deep learning |
| 23 | Kairouz (2021) | Federated Learning | Large-scale federated systems | Research directions and limitations |
| 24 | Bommasani (2021) | Responsible AI & Governance | AI safety and societal impact | Foundation model governance |
| 25 | Brown (2020) | Large Language Models | Transformer-based language learning | Few-shot LLM capabilities |
| 26 | Vaswani (2017) | Deep Learning Architectures | Attention mechanisms | Transformer neural architecture |
| 27 | Breck (2017) | ML Reliability Engineering | ML deployment readiness | ML production evaluation rubric |
| 28 | Sculley (2015) | Production ML Engineering | Operational ML maintenance | Hidden technical debt analysis |
| 29 | Amershi (2019) | Software Engineering for AI | Industrial ML deployment lifecycle | Enterprise ML engineering practices |
| 30 | Pearl (2009) | Causal Inference | Structural causal reasoning | Probabilistic causality frameworks |
| 31 | Imbens (2015) | Applied Causal Inference | Statistical causal modeling | Biomedical and social causal analysis |
| 32 | Jordan (2015) | Machine Learning Research Trends | AI synthesis and forecasting | Future ML research directions |
| 33 | Goodfellow (2016) | Deep Learning Foundations | Neural network methodologies | Foundational deep learning theory |
| 34 | Gandomi (2015) | Big Data Analytics | Big data conceptual synthesis | Beyond-3Vs analytics framework |
Appendix A. Reference Mapping for Figures, Tables, Frameworks, and Representative Technologies
Appendix A1. Figure-to-Reference Mapping
| Figure | Description | Key Concepts / Technologies | Foundational References |
| Figure 1 | End-to-end modern data science lifecycle | KDD workflow, ingestion, preprocessing, analytics, deployment | Fayyad et al. (1996); Jordan & Mitchell (2015) |
| Figure 2 | Distributed analytics and AI infrastructure | Spark, TensorFlow, MLlib, stream processing | Zaharia et al. (2016); Abadi et al. (2016); Meng et al. (2016) |
| Figure 3 | Expanded characteristics of big data | 3Vs, veracity, value, validity, variability | Laney (2001); Gandomi & Haider (2015); Chen et al. (2014) |
| Figure 4 | Unified taxonomy for modern data science ecosystems | Lakehouse, MLOps, RAG, governance, observability | Armbrust et al. (2021); Zaharia et al. (2021); Bommasani et al. (2021); Lewis et al. (2020) |
Appendix A2. Table-to-Reference Mapping
| Table | Main Theme | Supporting References |
| Table 1 | Evolution of data science and big data | Fayyad et al.; Laney; Lyman & Varian |
| Table 2 | Distributed analytics frameworks | Spark, Flink, TensorFlow, MLlib |
| Table 3 | Machine learning algorithms and frameworks | XGBoost; Scikit-learn; TensorFlow |
| Table 4 | Comparative analytics framework evaluation | Spark; Dataflow; Ray; Dask literature |
| Table 5 | Explainable AI and responsible AI frameworks | Ribeiro; Lundberg; Molnar; Bommasani |
| Table 6 | MLOps and deployment engineering | Breck; Sculley; Amershi |
| Table 7 | Federated learning and privacy-preserving AI | McMahan; Kairouz |
| Table 8 | Future trends in AI systems and governance | Brown; Vaswani; Pearl; Jordan |
Appendix A3. Framework and Technology Mapping
| Technology / Framework | Functional Role | Representative References |
| Delta Lake | ACID lakehouse storage | Zaharia et al. (2021) |
| Lakehouse Architecture | Unified analytics platform | Armbrust et al. (2021) |
| Apache Spark | Distributed analytics engine | Zaharia et al. (2016) |
| TensorFlow | Deep learning infrastructure | Abadi et al. (2016) |
| MLlib | Distributed machine learning | Meng et al. (2016) |
| XGBoost | Scalable boosting algorithm | Chen & Guestrin (2016) |
| Scikit-learn | Classical machine learning toolkit | Pedregosa et al. (2011) |
| LIME | Explainable AI | Ribeiro et al. (2016) |
| SHAP | Explainable AI | Lundberg & Lee (2017) |
| Federated Learning | Privacy-preserving distributed AI | McMahan et al. (2017); Kairouz et al. (2021) |
| RAG | Retrieval-enhanced generation | Lewis et al. (2020) |
| Foundation Models | Generative AI systems | Bommasani et al. (2021) |
| Transformer Networks | Attention-based deep learning | Vaswani et al. (2017) |
Appendix A4. Foundational Seminal Literature Integrated Throughout the Manuscript
| Foundational Topic | Seminal Reference |
| Knowledge Discovery in Databases (KDD) | Fayyad et al. (1996) |
| Big Data 3Vs | Laney (2001) |
| Digital Information Explosion | Lyman & Varian (2003) |
| Big Data Analytics Concepts | Gandomi & Haider (2015) |
| Big Data Systems Survey | Chen et al. (2014) |
| Machine Learning Research Trends | Jordan & Mitchell (2015) |
| Deep Learning Foundations | Goodfellow et al. (2016) |
| Explainable AI | Ribeiro et al. (2016); Lundberg & Lee (2017) |
| Federated Learning | McMahan et al. (2017) |
| Foundation Models | Bommasani et al. (2021) |
| Retrieval-Augmented Generation | Lewis et al. (2020) |
| Causal Inference | Pearl (2009); Imbens & Rubin (2015) |