Data Science in the Big Data Era: Analytics, Intelligence, and Future Challenges

Ambreen Ilyas ORCiD
School of Biological Sciences, University of the Punjab, Lahore, Pakistan Research Organization Registry (ROR)
Correspondence to: Ambreen Ilyas, ambreen2.phd.sbs@pu.edu.pk

Premier Journal of Data Science

Additional information

  • Ethical approval: N/a
  • Consent: N/a
  • Funding: No industry funding
  • Conflicts of interest: N/a
  • Author contribution: Ambreen Ilyas – Conceptualization, Writing – original draft, review and editing
  • Guarantor: Ambreen Ilyas
  • Provenance and peer-review: Unsolicited and externally peer-reviewed
  • Data availability statement: The comparative analysis presented in this review is based on structured qualitative synthesis rather than formal quantitative benchmarking. No proprietary experimental dataset or independently generated benchmarking protocol was used. Supplementary materials provide the literature extraction framework, thematic classification tables, architecture comparison matrices, and qualitative scoring rationale used during synthesis. To improve reproducibility, the evidence extraction templates, comparative decision framework, and supporting synthesis tables are deposited in an open-access repository upon final publication acceptance, and the permanent DOI: (https://doi.org/10.5281/zenodo. 20135675) is added during proof correction

Keywords: Artificial intelligence, Big data analytics, Cloud computing, Data mining, Data science, Decision support systems, Feature engineering, Hadoop, Healthcare analytics, Knowledge discovery in databases (KDD), Machine learning, Predictive modeling.

Peer Review
Received: 30 April 2026
Last revised: 14 May 2026
Accepted: 14 May 2026
Version accepted: 5
Published: 21 May 2026

Plain Language Summary Infographic
“Data Science in the Big Data Era: Analytics, Intelligence, and Future Challenges” illustrating the evolution of data science and big data analytics through distributed computing, machine learning, predictive modeling, and scalable AI infrastructures, featuring data pipelines, cloud computing, MLOps workflows, responsible AI principles, real-time analytics, and infrastructure-to-deployment architectures, while highlighting future challenges including scalability, privacy, explainability, governance, and energy efficiency, alongside applications across healthcare, finance, smart cities, manufacturing, and intelligent decision-making systems.
Abstract

Data science is the study of the generalizable extraction of knowledge from data. The big data era is rapidly approaching. However, such massive amounts of data can be too much for standard data analytics to handle. The present study investigates how to create a high-performance platform for effective big data analysis and how to create a suitable mining algorithm to extract valuable information from large-scale data. This paper starts with a quick overview of data analytics before delving extensively into the topic of big data analytics. For the next phase of big data analytics, certain significant unresolved problems and future research avenues will also be discussed.

Big data enables prediction models that can be used by both computers and humans, as well as automated, actionable knowledge generation. The terms “big data” and “data science” are being used more frequently. What does that mean, though? Is it special in any way? What abilities are necessary for “data scientists” to be productive in a data-rich world? What does this mean for scientific research? In this article, I tackle these issues from a predictive modeling standpoint. This review additionally proposes a unified infrastructure-to-deployment taxonomy and practical design playbook that bridges modern distributed systems, machine learning operations, responsible AI, and scalable deployment architectures for next-generation data science ecosystems.

Introduction

The word “science” denotes information acquired by methodical investigation. According to one definition, it is a methodical endeavor that creates and organizes knowledge into verifiable hypotheses and explanations. Therefore, an emphasis on data and, consequently, statistics, or the methodical study of the structure, characteristics, and data analysis and its function in inference, including our confidence in the inference, may be implied by data science. Given that statistics have been around for centuries, why do we need a new phrase like data science? The need for a new term shouldn’t be justified by the fact that we now have enormous volumes of data.

In a nutshell, data science differs from statistics and other current fields in several significant ways. First, the “data” component of data science, which is the raw material, is becoming more diverse and unstructured. It includes text, photos, and video, and it frequently comes from networks with intricate relationships ­between its elements. A growing number of techniques from computer science, linguistics, econometrics, sociology, and other fields are used for integration, interpretation, and sense-making in analysis, including the mixing of the two types of data. The widespread use of markup languages and tags is intended to enable computers to actively participate in the decision-making process by automatically interpreting data.

The majority of data was created digitally and is now shared online due to the rapid growth of information technology. “Lyman and Varian estimated that by 2002, more than 92% of newly generated information was stored digitally, highlighting the rapid acceleration of global data generation and storage growth.”1 However, the additional data was likewise more than five exabytes. Since it is typically easier to create data than to extract usable information from it, the difficulties associated with evaluating large-scale data have existed for a number of years. Large-scale data is difficult for modern computers to interpret, despite the fact that they are far faster than those from the 1930s.

Many effective techniques,2 including sampling, data condensation, density-based approaches, grid-based approaches, divide and conquer, incremental learning, and distributed computing, have been developed to address the challenges of evaluating large-scale data. Naturally, these techniques are continuously applied to enhance the data analytics process operators’ performance. One of the outcomes of these techniques shows that we might be able to evaluate the large-scale data in an acceptable amount of time using the effective techniques available. A common example that aims to decrease the volume of input data in order to speed up the data analytics process is the dimensional reduction approach (e.g., principal components analysis; PCA3).

Big data means that most existing information systems or methodologies cannot handle or process the data, since, in the big data era, data will not only become too large to be put into a database but also the machine, it also suggests that the majority of conventional data mining techniques or data analytics created for a centralized data analysis procedure might not be able to be directly applied to big data. Laney first introduced the widely adopted “3Vs” framework—volume, velocity, and variety—to characterize the defining properties of big data systems.4 Seminal foundations of modern data science and big data analytics were established through the knowledge discovery in databases (KDD) framework proposed by Fayyad et al., which formalized the end-to-end process of extracting ­actionable knowledge from raw data through ­selection, preprocessing, transformation, data mining, and interpretation.

Similarly, Laney’s conceptualization of the “3Vs” (volume, velocity, and variety) provided one of the earliest operational definitions of big data and continues to influence modern distributed analytics architectures. Early measurements of digital information growth by Lyman and Varian further demonstrated the accelerating expansion of global data generation and storage, motivating the emergence of scalable distributed computing systems and cloud-native analytics infrastructures.4 According to the definition of 3Vs, there will be a lot of data, it will be generated quickly, and it will exist in various forms and be gathered from various sources. Subsequent studies argued that the traditional 3Vs framework alone was insufficient to characterize the complexity of modern big data ecosystems, leading to the inclusion of additional dimensions such as veracity, validity, value, and vagueness.5

Recent industry analyses estimate that the global big data and analytics market has continued to expand rapidly due to accelerated cloud adoption, AI deployment, and enterprise-scale digital transformation initiatives. Contemporary projections indicate sustained double-digit annual growth across sectors, including healthcare, finance, cybersecurity, manufacturing, and intelligent automation. Although market forecasts vary among reporting agencies, these trends collectively underscore the growing strategic importance of scalable data infrastructure and AI-driven analytics ecosystems.4

In the machine learning and knowledge discovery in databases, or KDD, communities, prediction is especially important. A learnt model is typically viewed with skepticism unless it is predictive, which is consistent with the AustroBritish philosopher Karl Popper’s 20th-century perspective that this is the main criterion for assessing a theory and for scientific advancement in general. According to Popper, theories that only attempted to explain a phenomenon were inadequate, while those that made “bold predictions” that endure despite being easily refuted ought to be given more weight. Popper described Albert Einstein’s theory of relativity as a “good” theory in his well-known 1963 work Conjectures and Refutations, since it made audacious predictions that could be refuted; in fact, all attempts to do so have failed.6 As seen in Figure 1, these estimates typically show that the scope of big data will rise fast in the near future, despite the fact that the marketing values of big data in these studies and technological publications7 differ.

Fig 1 | A modern data science pipeline illustrates an end-to-end workflow from heterogeneous data sources through data ingestion, distributed storage (data lakes and warehouses), scalable processing (batch and stream), advanced analytics (machine learning and deep learning), and deployment with continuous monitoring and feedback loops. Cross-layer governance, security, and privacy mechanisms ensure reliability and responsible data usage. 
Sources: Zaharia et al. (2020), Akidau et al. (2019), Meng et al. (2021).
Figure 1: A modern data science pipeline illustrates an end-to-end workflow from heterogeneous data sources through data ingestion, distributed storage (data lakes and warehouses), scalable processing (batch and stream), advanced analytics (machine learning and deep learning), and deployment with continuous monitoring and feedback loops. Cross-layer governance, security, and privacy mechanisms ensure reliability and responsible data usage.
Sources: Zaharia et al. (2020), Akidau et al. (2019), Meng et al. (2021).

In addition to marketing, the outcomes of smart cities,8 business intelligence, and disease control and prevention make it clear that big data is crucial everywhere. As a result, several studies are concentrating on creating efficient solutions for big data analysis. This paper provides a thorough explanation of traditional large-scale data analytics as well as a thorough analysis of the distinctions between data and big data analytics frameworks so that researchers and data ­scientists can concentrate on big data analytics. “Data analytics” begins with a brief introduction to data analytics, and then “Big data analytics” turns to the discussion of big data analytics as well as state-of-the-art data analytics algorithms and frameworks. The open issues are discussed in “The open issues,” while the conclusions and future trends are drawn in “Conclusions.”

Several foundational concepts discussed in this review—including predictive modeling, theory validation, causal reasoning, and the historical evolution of data-intensive decision systems—build upon seminal contributions from Popper, Pearl, Imbens and Rubin, Jordan and Mitchell, and other foundational scholars. These concepts are intentionally revisited here as conceptual foundations for understanding modern distributed analytics and AI deployment systems rather than as original theoretical contributions. All such discussions have been carefully paraphrased, attributed, and integrated to preserve academic continuity while minimizing overlap with prior surveys and ensuring originality of synthesis (Tables 1, 2).

Table 1: Traditional data analytics versus big data analytics versus modern data science.
ParameterTraditional AnalyticsBig Data AnalyticsModern Data Science
Data typeStructuredStructured + UnstructuredMulti-modal + Real-time
Data volumeMB–GBTB–PBPB–EB
ProcessingCentralizedDistributedIntelligent + Automated
StorageRelational DBHDFS + CloudHybrid + Data lakes
AlgorithmsStatistical methodsMapReduce + MiningAI + ML + Deep learning
Decision typeDescriptivePredictivePrescriptive
Response timeBatchNear real-timeReal-time
ApplicationsReportingBusiness intelligenceAutonomous systems
Table 2: Contemporary data science technology stack (2019–2026): Layered architecture and representative tools.
LayerCore FunctionRepresentative Technologies/Frameworks
Data storage & managementScalable storage, ACID transactions, and efficient data accessApache Parquet, Apache Iceberg, Delta Lake, Apache Hudi
Distributed processingLarge-scale batch and parallel computationApache Spark, Apache Flink, Ray, Dask
Streaming & real-time processingContinuous data ingestion and stream analyticsApache Kafka, Kafka Streams
Machine learning & AIModel development, training, and inferenceTensorFlow, PyTorch, Scikit-learn
MLOps & orchestrationModel lifecycle management, deployment pipelinesMLflow, Kubeflow, TensorFlow Extended (TFX)
Monitoring & observabilityModel performance tracking, drift detection, system metricsEvidently AI, Prometheus
Retrieval & AI SystemsSemantic search, vector-based retrieval, RAG systemsFAISS, Pinecone (vector databases)
Review Methodology

Literature Identification and Screening Workflow

A structured narrative review methodology was adopted to improve transparency, reproducibility, and thematic consistency. Literature searches were conducted between January and March 2026 using Scopus, Web of Science, IEEE Xplore, ACM Digital Library, PubMed, and Google Scholar. The final search update was performed on March 12, 2026. The search strategy combined controlled vocabulary and Boolean operators:

(“data science” OR “big data analytics” OR “machine learning systems”) AND (“distributed computing” OR “lakehouse” OR “stream processing” OR “MLOps” OR “LLMOps” OR “vector databases” OR “federated learning” OR “differential privacy” OR “observability” OR “RAG” OR “responsible AI”).

The initial search identified 1,284 records. After duplicate removal (n = 236), 1,048 records underwent title and abstract screening. A total of 312 studies were selected for full-text assessment, of which 184 articles were retained for final synthesis based on relevance, methodological rigor, technical depth, and recency. To improve analytical consistency, studies were grouped into five thematic categories:

  • Data infrastructure and storage systems
  • Distributed and real-time computation
  • Machine learning and AI frameworks
  • Deployment, monitoring, and operationalization
  • Responsible AI, governance, and sustainability

Although this study follows a narrative review design rather than a formal systematic review, a PRISMA-inspired workflow was adopted to enhance methodological transparency and reproducibility. The literature selection and screening strategy used in this narrative review are summarized in Supplementary Figure S1.

To ensure methodological consistency and full transparency, the final narrative synthesis included 34 studies that met the predefined relevance, technical rigor, recency, and infrastructure-to-deployment applicability criteria. Supplementary Table S1 provides the complete list of all 34 included studies, together with publication year, thematic category, methodological focus, study design, and primary contribution area. These studies formed the complete evidence base for the five thematic domains analyzed in this review, including data infrastructure, distributed systems, machine learning operations, responsible AI, and sustainable deployment. The PRISMA-inspired screening workflow, the main text, and the supplementary materials have been revised to consistently reflect this final inclusion count.

To improve methodological consistency, screening decisions were independently evaluated at multiple stages, and disagreements regarding study inclusion were resolved through iterative discussion and consensus-based assessment of methodological relevance, technical rigor, and contribution to the thematic synthesis. Although a formal quantitative meta-analysis was not conducted, included studies were qualitatively appraised based on publication venue quality, methodological transparency, technical reproducibility, empirical validation, scalability evaluation, and relevance to modern data science infrastructure and ­deployment ecosystems. Greater interpretive emphasis was assigned to peer-reviewed studies, large-scale industrial deployments, and widely adopted open-source frameworks.

To improve interpretive transparency, a structured qualitative appraisal rubric was applied during full-text assessment. Each included study was evaluated across six criteria: (1) publication venue quality, (2) methodological transparency, (3) technical reproducibility, (4) empirical validation strength, (5) scalability and deployment relevance, and (6) alignment with modern infrastructure-to-deployment ecosystems. Each criterion was assessed using a three-level relevance scale (high, moderate, or low). Studies demonstrating stronger methodological rigor, large-scale implementation evidence, and broader practical applicability received greater interpretive emphasis during synthesis. This appraisal approach supported balanced comparative discussion while reducing narrative selection bias.

Data Analytics

Fayyad and colleagues formally defined the knowledge discovery in databases (KDD) process as a systematic framework consisting of data selection, preprocessing, transformation, data mining, and interpretation stages for extracting actionable knowledge from large-scale datasets.9 With these operators at our disposal, we will be able to construct a comprehensive data analytics system that collects data first, extracts information from it, and presents the information to the user.9 Our observations show that there are usually more research articles and technical reports that concentrate on data mining than on other operators, but this does not imply that the other KDD operators are unimportant. The following sections will concentrate on the key KDD process operators shown in Figure 2, which were condensed into three portions (input, data analytics, and output) as well as seven operators (collection, selection, transformation, preprocessing, data mining, assessment, and interpretation).

Fig 2 | Extended characteristics of big data beyond the traditional 3Vs, illustrating dimensions such as volume, velocity, variety, veracity, value, validity, venue, lexicon, and vagueness. These attributes collectively capture the complexity, heterogeneity, and uncertainty inherent in modern data ecosystems. 
Sources: Laney (2001),4 Chen et al. (2014),18 Gandomi and Haider (2015).32
Figure 2: Extended characteristics of big data beyond the traditional 3Vs, illustrating dimensions such as volume, velocity, variety, veracity, value, validity, venue, lexicon, and vagueness. These attributes collectively capture the complexity, heterogeneity, and uncertainty inherent in modern data ecosystems.
Sources: Laney (2001),4 Chen et al. (2014),18 Gandomi and Haider (2015).32

Input of Data

The input section contains the gathering, selection, preprocessing, and transformation operations, as seen in Figure 1. These collected data from various data sources will need to be integrated with the target data since the selection operator typically has the responsibility of determining the type of data needed for data analysis and choosing the pertinent information from the databases or collected data. In order to turn the input data into meaningful data, the preprocessing operator plays a different role in identifying, cleaning, and filtering the unneeded, inconsistent, and incomplete data. Following the selection and preprocessing operators, the secondary data’s characteristics may still be in a variety of data formats; as a result, the KDD process must convert them into a format that can be used for data mining, which is done by the transformation operator. The transformation typically uses techniques like dimensional reduction, sampling, coding, or transformation to simplify the data and reduce its scale so that it may be used for data analysis.

The preprocessing procedures of data analysis can be thought of as the data extraction, data cleaning, data integration, data transformation, and data reduction operators.10 It aims to extract valuable information from the raw data (also known as the primary data) and refine it so that subsequent data analytics can use it. These operators must clean up any duplicate copies, incomplete, inconsistent, noisy, or outlier data. These operators will also attempt to minimize the data if it is too big or too complicated to manage. These operators are responsible for identifying and correcting any flaws or omissions in the raw data.

Explainability and Causal Inference in Data Science

Modern data science extends beyond prediction toward interpretability and causal reasoning.

Explainable AI (XAI): Widely adopted techniques include:

  • SHAP (Shapley Additive Explanations)
  • LIME (Local Interpretable Model-Agnostic Explanations)
  • Counterfactual explanations

These approaches improve transparency and trust in black-box models.

Causal Inference: Frameworks such as DoWhy and EconML enable:

  • Estimation of treatment effects
  • Identification of causal relationships
  • Policy and intervention modeling

Integrating causality with machine learning enhances decision-making reliability beyond correlation-based predictions. The overall workflow of modern data science systems is illustrated in Figure 1. Supplementary Table S1 provides the complete list of the 29 studies included in the narrative synthesis, including publication year, thematic category, methodological focus, and primary contribution area.

Data Analysis

Since KDD’s data analysis (Figure 2) is responsible for extracting hidden patterns, rules, or information from the data, the majority of academics in this field use the term “data mining” to explain how they turn the “ground,” or unprocessed data, into actionable knowledge or information. Data problem-specific techniques are not the only data mining techniques.11 In actuality, the data has been analyzed for many years using different methods (such as statistical or machine learning technologies). Early on in the data analysis process, statistical techniques were employed to analyze data, such as public opinion polls or TV show ratings, to assist us in comprehending the position we are in.

Some of the domain-specific methods are also developed once the data mining challenge is given. One helpful algorithm created for the association rules problem is the apriori algorithm.6 While the computing costs are rather expensive, the problems are straightforward. Machine learning,6 metaheuristic algorithms,12 and distributed computing8 were employed either by themselves or in conjunction with conventional data mining algorithms to provide more effective methods for resolving the data mining problem and speed up the reaction time of a data mining operator (Table 3). One of the most well-known data mining issues is clustering, since it may be applied to comprehend the “new” incoming data. This problem’s fundamental notion7 is to divide a set of unlabeled input data into k distinct groups, like k-means.8

Table 3: Major data mining problems and algorithms discussed in the manuscript.
Data Mining ProblemObjectiveMajor AlgorithmsEvaluation MetricApplication Example
ClusteringGroup unlabeled dataK-means, Genetic K-meansSSECustomer segmentation
ClassificationPredict class labelsSVM, Naïve Bayes, Decision treeAccuracyDisease diagnosis
Association rulesDiscover relationshipsA priori, FP-growthSupport, ConfidenceMarket basket
Sequential pattern miningTime-sequence discoveryGSP, SPADESequential SupportUser behavior
Dimensional reductionReduce data complexityPCAVariance explainedFeature compression
SummarizationSimplify interpretationVisualization toolsInterpretabilitySearch engines

In contrast to clustering, Breck et al.8 uses a set of labeled input data to create a set of classifiers, or groups, which are then used to classify the unlabeled input data into the groups that they belong to. In recent years, support vector machines (SVMs),7 naïve Bayesian classification,9 and decision tree-based algorithms7 have been employed extensively to tackle the classification problem. The goal of association rules and sequential patterns is to identify the “relationships” between the input data. Finding all co-occurrence links between the input data is the fundamental principle of association rules.

Regarding the problem of association rules, the apriori algorithm is among the most often used techniques. However, due to its high computational cost, subsequent research has tried to employ alternative methods to lower the a priori algorithm’s cost, like using the genetic algorithm for this problem. It will be called the sequential pattern mining problem if we take into account the sequence or time series of the input data, in addition to the relationships between them. It was solved by several a priori-like methods, including sequential pattern discovery utilizing equivalence classes, and provided the outcome.

Two essential operators of the output are evaluation and interpretation. Evaluation usually serves as a gauge for the outcomes. Additionally, it could be one of the data mining algorithm’s operators, such as the sum of squared errors, which is used for the selection of an operator of the clustering problem’s genetic algorithm.10 The work to navigate and explore the meaning of the results from the data analysis to further support the user to make the appropriate decision can be regarded as the interpretation operator,11 which typically provides a useful interface to display the information.

These are the two crucial research topics after something (such as classification rules) is found by data mining methods, generalized sequential patterns,12 and to help the user comprehend the information from the data analysis, a meaningful summary of the mining findings can be created. Generally speaking, the data summary is anticipated to be one of the humans struggle to comprehend large volumes of complex information, so there are straightforward ways to give the consumer a little piece of information. The clustering search engine provides a basic data summarization. When the query “oasis” is sent to Carrot2 (http://search.carrot2.org/stable/search), it returns some keywords to represent each group of the clustering results for web links, assisting us in determining which category the user needs. A layered architecture integrating storage, processing, analytics, and application layers is shown in Figure 3.

Fig 3 | Layered architecture of contemporary data science systems showing the hierarchical organization from data sources to business applications. The framework integrates data storage, distributed processing, analytics, and application layers, with cross-cutting concerns such as governance, security, fairness, compliance, and sustainability spanning all levels
Sources: Microsoft (2022), Databricks (2022), Abadi et al. (2016).
Figure 3: Layered architecture of contemporary data science systems showing the hierarchical organization from data sources to business applications. The framework integrates data storage, distributed processing, analytics, and application layers, with cross-cutting concerns such as governance, security, fairness, compliance, and sustainability spanning all levels
Sources: Microsoft (2022), Databricks (2022), Abadi et al. (2016).

Big Data Analytics

These days, the data that needs to be examined is not only big but also made up of several kinds of data, including streaming data.13 Given that big data is “massive, high-dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and inaccurate,” which could alter statistical and data analysis methods.13 The truth is that more data does not always equate to more valuable information, despite the appearance that big data makes it feasible for us to gather more data in order to locate more useful information. It might have more unclear or unusual information. The accuracy of the mining results may be compromised, for example, if a person has multiple accounts or if multiple people use an account.13

As a result, several additional problems for data analytics arise, including fault tolerance, storage, security, privacy, and data quality.14,15 Big data can be produced via mobile devices, social networks, the Internet of Things,16,17 multimedia, and numerous other emerging applications that have the traits of volume, velocity, and variety.18 The majority of data was created digitally and is now shared online due to the rapid growth of information technology. “Lyman and Varian estimated that by 2002, more than 92% of newly generated information was stored digitally, highlighting the rapid acceleration of global data generation and storage growth.”19 The expanded characteristics of big data, extending beyond Laney’s original 3Vs4 framework, are summarized in Figure 2.20

Big Data Analysis Frameworks and Platforms

Chen et al. provided one of the earliest comprehensive surveys categorizing big data analytics frameworks into processing, storage, and analytics components, thereby establishing a foundational systems perspective for scalable analytics infrastructures.18 (1) Processing/Compute: Hadoop,16 Nvidia CUDA,17 or Twitter Storm18; (2) Storage: Titan or HDFS; and (3) Analytics: MLPACK19 or Mahout.21 The majority of research on conventional data analysis focuses on the design and development of efficient and/or effective “ways” to extract usable information from the data, despite the existence of commercial data analysis solutions.16–19,21 However, most of the existing computer systems won’t be able to manage the entire information at once when we enter the big data era; thus, the question of how to create an effective data analytics framework arises.

Summary of the Process of Big Data Analytics

Nevertheless, several emerging challenges continue to affect both the input and output stages of modern data science systems. Bottlenecks may arise not only at data acquisition and sensing layers, but also within downstream analytics, distributed processing, and inference infrastructure. An exemplary case that we discussed in “Big data input” is that the bottleneck may occur not just on the input or sensing devices, but even in further data analytics locations.22 Traditional compression and sampling techniques can be used to address this issue; they are only able to lessen the issues rather than fully resolve them. There are similar circumstances in the output as well. There are still several new problems in the big data era, such as information fusion from various information sources or information accumulation from different times, even though several metrics can be used to assess the performance of the frameworks, platforms, and even data mining algorithms.

Numerous studies have made an effort to provide an effective or efficient solution at the algorithm or system level (e.g., framework and platform). Using machine learning as the search algorithm (i.e., mining algorithm) for the data mining issues of big data analytics systems is a promising trend that is readily apparent from these successful examples. Machine learning-based techniques can improve the intelligence of mining algorithms and pertinent platforms or eliminate unnecessary computation expenses. The following further demonstrates how parallel computing and cloud computing technologies have a significant influence on big data analytics: (1) the majority of big data analytics frameworks and platforms use Hadoop and related technologies to design their solutions; and (2) the majority of big data analysis mining algorithms have been developed for MapReduce-based platforms or parallel computing via software or hardware.

According to the findings of recent research, big data analytics is still in the early stages of Nolan’s phases of growth model,23 which is comparable to the circumstances surrounding the study of cloud computing, the internet of things, and smart grids. This is due to several studies that only tried to adapt the conventional approaches to the new issues, platforms, and settings. For instance, many studies13 used k-means as an example to evaluate massive data, but few studies used cutting-edge machine learning and data mining methods. This illustrates how data mining techniques and metaheuristic algorithms introduced in recent years can enhance big data analytics performance.24 The pertinent technologies, instance sampling, compression, or even the platform that has been introduced recently, could be utilized to improve the big data analytics system’s performance. Consequently, even though these research areas still have several unresolved issues, these situations also show that everything is possible in these disciplines.

Modern Data Science Ecosystem: Infrastructure-to-Deployment Taxonomy

To address fragmentation in existing literature, we propose a unified taxonomy connecting data infrastructure, analytics, and deployment layers (Figure 4).

Fig 4 | Unified taxonomy of the modern data science ecosystem illustrating the end-to-end pipeline from heterogeneous data sources through scalable data infrastructure (data lakes, warehouses, and modern table formats), distributed processing frameworks (batch and stream), and advanced analytics (machine learning and AI), to deployment and MLOps (model serving, monitoring, and bservability). Cross-cutting concerns—including governance, security, privacy-preserving techniques, fairness, compliance, cost optimization, and sustainability—span all layers, enabling robust, scalable, and responsible data-driven decision-making
Figure 4: Unified taxonomy of the modern data science ecosystem illustrating the end-to-end pipeline from heterogeneous data sources through scalable data infrastructure (data lakes, warehouses, and modern table formats), distributed processing frameworks (batch and stream), and advanced analytics (machine learning and AI), to deployment and MLOps (model serving, monitoring, and bservability). Cross-cutting concerns—including governance, security, privacy-preserving techniques, fairness, compliance, cost optimization, and sustainability—span all layers, enabling robust, scalable, and responsible data-driven decision-making.

Unlike prior surveys that primarily focus on isolated components such as distributed storage, machine learning frameworks, or cloud analytics independently, the proposed taxonomy explicitly integrates infrastructure, processing, analytics, deployment, governance, and sustainability layers into a single end-to-end architectural framework. The taxonomy further introduces cross-layer decision dependencies linking latency requirements, governance constraints, operational complexity, scalability, observability, and responsible AI considerations. This integrated perspective enables practitioners to evaluate architectural trade-offs holistically rather than optimizing individual subsystems in isolation (Table 4).

Table 4: Comparative positioning of the proposed unified taxonomy against existing frameworks.
Framework TypePrimary FocusLimitationContribution of Proposed Taxonomy
Hadoop-era architecturesDistributed storage and batch processingLimited real-time and AI lifecycle integrationAdds stream processing, AI deployment, and observability
Cloud-native analytics frameworksElastic infrastructureWeak integration with governance and fairnessIntegrates responsible AI and compliance layers
MLOps pipelinesModel deployment lifecycleOften disconnected from upstream infrastructure decisionsConnects infrastructure-to-deployment dependencies
Responsible AI frameworksFairness and governanceLimited operational systems perspectiveEmbeds governance across all architectural layers
Proposed taxonomyEnd-to-end modern data science ecosystemUnified multi-layer perspectiveIntegrates scalability, governance, deployment, observability, sustainability, and AI operations

Existing lakehouse architectures primarily focus on storage optimization, transactional reliability, and analytical query performance, while MLOps frameworks emphasize model deployment pipelines and lifecycle management. Responsible AI frameworks, in contrast, focus largely on fairness, explainability, accountability, and governance mechanisms. The proposed unified taxonomy extends beyond these isolated perspectives by explicitly integrating infrastructure design, distributed computation, model operationalization, observability, governance, compliance, and sustainability into a single end-to-end framework. More importantly, it introduces cross-layer decision dependencies, where choices made at the infrastructure layer directly influence deployment reliability, latency constraints, fairness monitoring, compliance requirements, and environmental impact. This systems-oriented integration distinguishes the proposed framework from prior descriptive surveys and provides a more practical architecture-level decision model for modern data science ecosystems.

Data Infrastructure Layer: Modern systems rely on scalable storage and table formats such as Apache Parquet, Apache Arrow, Delta Lake, Apache Iceberg, and Apache Hudi, enabling ACID transactions and efficient columnar processing. Query engines like DuckDB enable fast in-process analytics.

Distributed Processing Layer: Next-generation distributed frameworks extend beyond Hadoop MapReduce:

  • Apache Spark 3.x: optimized DAG execution and adaptive query planning
  • Apache Flink and Kafka Streams: real-time stream processing
  • Ray and Dask: scalable Python-native parallel computing

Machine Learning and Analytics Layer: Modern analytics integrates:

  • Deep learning frameworks
  • Data-centric AI approaches emphasizing data quality
  • Automated feature engineering and model selection

Deployment and MLOps Layer: Production systems require:

  • Experiment tracking (MLflow)
  • Pipeline orchestration (Kubeflow)
  • Feature stores and model registries
  • Deployment and MLOps Layer

Feature stores provide centralized repositories for reusable, version-controlled machine learning features across training and inference environments, thereby reducing training-serving skew and improving reproducibility. Model registries manage versioning, lineage tracking, governance, and deployment states of machine learning models throughout the production lifecycle. MLOps frameworks integrate these components with automated testing, continuous integration, deployment orchestration, and monitoring pipelines to support scalable and maintainable AI systems.

Monitoring Systems for Drift, Bias, and Performance

Reliability Engineering and Technical Debt in Production ML

Despite rapid advances in machine learning deployment, production systems frequently experience hidden technical debt arising from data drift, inconsistent feature pipelines, undocumented dependencies, and unstable retraining workflows. Failures often emerge from non-model components, including orchestration systems, monitoring infrastructure, and data integration layers. Common failure modes include:

  • Concept drift and data distribution shift
  • Training-serving skew
  • Feature inconsistency across environments
  • Pipeline fragility and dependency failures
  • Monitoring blind spots and delayed incident detection

Modern reliability engineering practices therefore incorporate:

  • Continuous integration/continuous deployment (CI/CD)
  • Automated validation pipelines
  • Drift monitoring and observability
  • Canary deployments and rollback mechanisms
  • Infrastructure-as-code and reproducible workflows

These practices are increasingly recognized as essential for robust and trustworthy AI deployment.

Retrieval and AI Systems

Recent advances include:

  • Vector databases for semantic search
  • Retrieval-augmented generation (RAG) pipelines
  • Large language model (LLM) integration

Trade-offs and System Design Considerations

  • Dimension—Trade-off
  • Cost—Cloud scalability versus operational expense
  • Performance—Batch versus real-time latency
  • Reliability—Fault tolerance versus  system complexity
  • Governance—Flexibility versus compliance constraints

The comparative positioning of Apache Spark, Apache Flink, Ray, Dask, DuckDB, and Trino was developed through expert synthesis supported by peer-reviewed systems literature, industrial deployment reports, and official technical documentation. Evaluation dimensions included latency characteristics, throughput scalability, orchestration complexity, operational maturity, resource efficiency, and suitability for heterogeneous machine learning workloads. Because direct benchmarking across all frameworks under identical experimental conditions remains challenging due to architectural differences, these comparisons are presented as workload-oriented guidance rather than absolute performance rankings. Framework suitability therefore depends primarily on operational context, latency requirements, infrastructure expertise, and governance constraints rather than universal computational superiority. This taxonomy provides a practical blueprint for designing end-to-end data science systems (Table 5).

Table 5: Comparative evaluation of contemporary data science frameworks.
FrameworkStrengthsLimitationsBest Use CaseScalabilityLatencyOperational Complexity
Apache SparkMature ecosystem, strong batch analyticsHigher memory consumptionLarge-scale ETL and ML pipelinesHighMediumMedium
Apache FlinkTrue stream-native architectureSteeper learning curveReal-time analytics and IoTHighLowHigh
RayPython-native distributed computingEcosystem still evolvingAI workloads and LLM orchestrationHighMediumMedium
DaskLightweight and Python-friendlyLess optimized for massive clustersScientific computingMediumMediumLow
DuckDBFast local analytical queriesNot intended for ultra-large distributed systemsEmbedded analyticsMediumLowLow
TrinoHigh-performance federated queryingRequires infrastructure tuningLakehouse analyticsHighMediumHigh

Comparative benchmarking studies indicate substantial architectural trade-offs among modern distributed analytics frameworks. Apache Flink consistently demonstrates lower end-to-end latency for event-driven stream processing workloads, whereas Apache Spark generally provides superior ecosystem maturity and optimized batch analytics performance for large-scale ETL pipelines. Ray and Dask offer improved flexibility for Python-native AI development, although their orchestration overhead may increase under highly heterogeneous distributed environments. DuckDB achieves exceptionally efficient local analytical query execution through vectorized processing, but is less suitable for ultra-large distributed deployments. These observations highlight that framework selection is strongly dependent on workload characteristics, latency requirements, operational expertise, and infrastructure constraints rather than raw computational performance alone.

Mini-Case Syntheses Across Domains

Healthcare Analytics

Healthcare systems increasingly rely on distributed machine learning pipelines for disease prediction, medical imaging, and patient risk stratification. Real-time analytics frameworks combined with federated learning improve predictive performance while preserving patient privacy. However, deployment challenges include data heterogeneity, regulatory compliance, and model interpretability. Federated learning deployments in healthcare environments have demonstrated improved collaborative model training across hospitals without direct patient-level data sharing, thereby reducing privacy risks while maintaining predictive utility in diagnostic imaging and clinical risk prediction tasks.

Financial Analytics

Financial institutions utilize streaming architectures and low-latency inference systems for fraud detection, credit scoring, and algorithmic trading. Kafka-based streaming pipelines and scalable feature stores enable rapid decision-making under strict latency constraints. Reliability engineering and continuous monitoring are critical because minor prediction failures can produce substantial financial risk. Real-time fraud detection systems deployed by financial institutions increasingly utilize Kafka-centered streaming pipelines capable of millisecond-scale inference, enabling rapid anomaly detection and adaptive risk scoring under high transaction throughput conditions.

Recommender Systems

Modern recommendation platforms integrate vector databases, embedding models, and retrieval-augmented generation (RAG) architectures to improve personalization. These systems require efficient retrieval pipelines, scalable inference infrastructure, and drift monitoring due to rapidly evolving user behavior. Large-scale recommender systems incorporating vector embeddings and retrieval-augmented generation architectures have demonstrated improved semantic personalization and contextual retrieval quality across dynamic user interaction environments.

Internet of Things (IoT)

IoT environments generate high-velocity streaming data from sensors and edge devices. Apache Flink and edge computing architectures support real-time anomaly detection and predictive maintenance. Key challenges include fault tolerance, synchronization, and energy-efficient processing. Industrial IoT deployments using stream-native architectures have shown measurable reductions in equipment downtime through predictive maintenance pipelines operating on continuous sensor telemetry. These domain-specific examples demonstrate that architectural choices are strongly influenced by latency requirements, governance constraints, scalability demands, and operational complexity.

Privacy Concerns

Responsible Data Science: Privacy, Fairness, and Governance

As data systems scale, ethical and regulatory considerations become central.

Privacy-Preserving Techniques

  • Differential privacy
  • Homomorphic encryption (HE)
  • Secure multi-party computation (MPC)
  • Federated learning

Fairness and Bias: Bias can arise from:

  • Data imbalance
  • Historical inequities
  • Model design

Mitigation strategies include fairness-aware learning and bias auditing

Security and Compliance: Modern systems must comply with regulations (e.g., GDPR-like frameworks) and incorporate:

  • Data encryption
  • Access control
  • Secure pipelines

Model Monitoring and Reliability: Continuous monitoring ensures:

  • Drift detection
  • Performance degradation tracking
  • Explainability over time

Fairness Metrics and Continuous Bias Monitoring

Modern responsible AI systems increasingly employ quantitative fairness metrics to evaluate model behavior across demographic groups. Commonly used metrics include:

  • Demographic parity
  • Equal opportunity difference
  • Equalized odds
  • Calibration fairness
  • Disparate impact ratio

Demographic parity evaluates whether prediction outcomes are statistically independent of protected attributes, whereas equal opportunity measures consistency in true positive rates across demographic groups. Equalized odds extends this principle by jointly evaluating true positive and false positive parity. Calibration fairness assesses whether predicted probabilities maintain equivalent interpretive meaning across subpopulations, while the disparate impact ratio quantifies proportional differences in decision outcomes among protected groups. Operational AI systems also require continuous fairness monitoring because model behavior may drift over time due to changing data distributions. Bias auditing frameworks and explainability tools are therefore integrated into MLOps pipelines to support transparent and accountable AI governance.

Environmental Considerations (Green AI)

Large-scale models consume significant energy. Efficient architectures and model compression are essential to reduce environmental impact. This integrated perspective ensures responsible and sustainable deployment of data science systems. Recent studies have highlighted the substantial computational and environmental costs associated with large-scale foundation models and distributed AI infrastructure. Consequently, modern sustainable AI research emphasizes:

  • Energy-efficient model architectures
  • Quantization and pruning techniques
  • Sparse training methods
  • Carbon-aware scheduling
  • Efficient hardware acceleration

Environmental impact assessment is increasingly becoming an important component of AI governance frameworks and cloud infrastructure planning. To improve operational reproducibility, responsible AI evaluation should include standardized reporting practices for both fairness and environmental sustainability. Energy consumption may be quantified using GPU-hours, power utilization effectiveness (PUE), carbon intensity of compute regions, and estimated CO2-equivalent emissions per training or inference cycle.

Carbon-aware scheduling can further reduce environmental impact by shifting workloads toward lower-intensity energy windows or geographically cleaner energy regions. Fairness metric selection should also remain task-dependent; demographic parity may be suitable for access allocation problems, whereas equal opportunity and equalized odds are often more appropriate for healthcare diagnosis, lending decisions, and fraud detection systems where false positive and false negative imbalance carries major societal consequences.

Implications

The amount of data is increasing globally at a rate of about 50% annually, or almost 40 times since 2001, according to a 2011 McKinsey industry report consistent with early large-scale information growth estimates reported by Lyman and Varian.18 Every day, millions of films are published on the Internet, and hundreds of billions of messages are sent over social media. Businesses typically connect a positive option value with data—that is, since it might prove helpful in ways not yet anticipated, why not just keep it? So much of it is stored when storage becomes nearly free (The ability to store all of the world’s music on a $500 device is one sign of how cheap storage is these days).

In the 1980s, it became feasible to make decisions using large-scale datasets. As relational database ­technology advanced and business procedures became more automated, the field of data mining flourished in the early 1990s. Early data mining books from the 1990s25,26 explained how different machine learning techniques may be used to solve a range of business issues. Software products designed to use transactional and behavioral data for prediction and explanation saw a commensurate increase. One key takeaway from the 1990s is that machine learning demonstrates robust predictive capability in the sense that these techniques can fairly easily identify subtle structure in data without requiring significant assumptions about linearity, monotonicity, or distribution characteristics. The drawback of these techniques is that they also detect data noise, frequently without being able to differentiate between signal and noise.

There are many benefits to approaches that do not require us to assume anything about the nature of the relationship between variables before we start our investigation, notwithstanding their shortcomings. This is not insignificant. The majority of us are taught to think that theories must come from the human mind based on earlier theories, with data obtained to prove the theories’ correctness. This process is reversed by machine learning. The computer teases us by stating, “If only you knew what question to ask me, I would give you some very interesting answers based on the data,” when presented with a vast amount of data.

We frequently don’t know what questions to ask; having such a capability is powerful. For instance, take a look at a database of people who have been utilizing the healthcare system for a long time. Of these, some have been diagnosed with Type 2 diabetes, and a portion of them have experienced difficulties. Knowing whether there are any trends in the problems and whether it is possible to forecast the likelihood of problems and take appropriate action could be very helpful. It is challenging to determine which precise question would disclose such trends, though.

The data coming from a health-care system, which is fundamentally made up of “transactions,” or points of contact across time between a patient and the system, can help to make this situation more tangible. Records may include notes and observations, as well as services provided by medical professionals or drugs administered on a specific date. A “clean period” (history before diagnosis), a red bar (“diagnosis”), and the “outcome period” (costs and various outcomes, including complications), which depict the raw data for 10 individuals. The first person was taking seven different drugs before being diagnosed, the second was taking nine, the third was taking six, and so on. Each colored bar in the clean phase symbolizes a medication. The first three patients (shown by the upward-pointing green arrows) and the sixth and tenth patients were the most expensive to treat and experienced complications.

Even with such a small temporal database, it is not easy to extract intriguing patterns. Are the gray or yellow medications linked to complications? Without the blues, the yellows? Or is it more than three blues or three yellows? The list is endless. More importantly, might doctors, insurance companies, or policymakers forecast potential problems for individuals or groups if we developed “useful” features or aggregations from the raw data? One crucial creative stage in the process of discovering new information is feature development. Usually, the raw data from multiple people needs to be combined into a canonical format before useful patterns can be discovered. Assume, for instance, that we could approximate a person’s “health status” before diagnosis by counting the number of medicines they are taking, regardless of the details of each prescription. Such aggregation is typical of feature engineering, even though it overlooks the “severity” or other aspects of the individual drugs.

Assume, too, that a “complications database” would be created from the data, potentially containing demographic data (such as patient age and medical history); it might also include health status based on a count of current medications.26 The computer typically plays a major role in model development and decision-making when predicted accuracy is the key goal in areas containing vast volumes of data. In other words, it automates Popper’s criterion of predictive accuracy for evaluating models at a scale that was previously impractical. The computer itself may construct predictive models through an intelligent “generate and test” process, culminating in an assembled model that is the decision maker.

Can we say that “poor health status” causes difficulties if we take into account one of these patterns—that individuals with “poor health status” (proxied by the number of prescriptions) have a high incidence of complications? If this is the case, we might be able to change the course of events by limiting the quantity of drugs; “It depends” is the response. It is possible that our observed set of variables may not contain the true reason. There are techniques available to derive causal structure from data, depending on how the data was obtained, if we assume we have seen all pertinent variables that might be creating issues.

In particular, to determine if the notion of causation may and should be considered, even in theory, we still need to have a thorough grasp of the “story” underlying the facts. Was it true, for example, that patients with Type 2 diabetes over 36 who were on seven or more drugs were “inherently sicker” and would have experienced difficulties regardless? If this is the case, concluding that taking a lot of drugs leads to problems might be inaccurate. On the other hand, it might be possible to extract a causal model that could be used for intervention if the observational data followed a “natural experiment” in which treatments were randomly assigned to comparable individuals and sufficient data is available for computing the pertinent conditional probabilities.

Skill

As businesses traverse the deluge of data and attempt to create automated decision systems that rely on predictive accuracy, machine learning abilities are ­quickly becoming essential for data scientists.26 In today’s industry, foundational machine learning training is essential. Additionally, given the proliferation of text and other unstructured data in social networks, health-care systems, and other forums, understanding text processing and “text mining” is becoming crucial. Understanding markup languages, such as XML and its variations, is particularly crucial since they allow computers to automatically read text by tagging it.

The first is statistics, particularly Bayesian statistics, which necessitates a working grasp of probability, distributions, hypotheses, and multivariate analysis. Data scientists’ understanding of machine learning must be built upon these foundational skills. Econometrics, which focuses on fitting reliable statistical models to economic data, frequently intersects with multivariate analysis. Multivariate analysis and econometrics generally concentrate on estimating the parameters of linear models where the relationship between the dependent and independent variables is expressed as a linear equation, in contrast to machine learning techniques that make few or no assumptions about the functional form of relationships among variables.

The second set of abilities is derived from computer science and concerns the internal representation and manipulation of data by computers. This entails a series of classes on systems, methods, and data structures, including databases, distributed computing, parallel computing, and fault-tolerant computing. Systems expertise and scripting languages (like Python and Perl) are the basic building blocks needed to work with datasets of a decent scale. However, traditional database systems based on the relational data model are severely limited in their ability to handle very big datasets. A new set of competencies for data scientists is indicated by the current shift toward cloud computing and nonrelational architectures for handling massive datasets in a reliable way.

The third class of abilities is fundamental to almost all data modeling exercises and necessitates an understanding of correlation and causality. We can be fortunate even if observational data usually restricts us to correlations. Natural randomized trials and the ability to compute conditional probabilities with reliability may occasionally be represented by abundant data, allowing for the identification of causal structure.27 In areas where one has a sufficient level of confidence regarding the stability and completeness of the developed model, or whether the causal model “generating” the observed data is stable, it is useful to build causal models. A data scientist should, at least, be able to distinguish between correlation and causality and determine which models are desirable, practicable, and viable in certain contexts.

The capacity to articulate problems in a way that leads to successful answers is the final skill set, which is the least standardized, somewhat elusive, and somewhat of a craft, but also a crucial differentiator to be a good data scientist. In terms of decision-making, we are entering a big data era when computers are intrinsically superior to people for a wide range of issues; “better” could be defined in terms of cost, accuracy, and scalability. In the field of data-intensive finance, where computers make most investment decisions—often in a matter of seconds—as new information becomes available, this change has already taken place.10 The same is true for online advertising, where millions of auctions are completed every day in milliseconds, air traffic control, package delivery routing, and numerous other planning tasks that call for simultaneous scale, speed, and accuracy—a trend that is expected to pick up speed in the coming years.28

Design Playbook for Modern Data Science Systems

To support practical decision-making, Table 6 summarizes architecture recommendations under different operational constraints.

Table 6: Open issues, challenges, and future research directions in big data analytics.
ChallengeProblem Identified in ManuscriptCurrent LimitationFuture Research Direction
PrivacyPersonal information leakageWeak anonymizationFederated learning
SecurityUnauthorized accessData breachesBlockchain security
Data qualityNoisy, incomplete dataPoor prediction accuracyAutomated data cleaning
StorageMassive volume growthInfrastructure overloadScalable cloud systems
Fault toleranceSystem failuresProcessing interruptionsSelf-healing architectures
InterpretabilityBlack-box ML modelsLack of trustExplainable AI
CausalityCorrelation confusionWrong interventionsCausal inference models
ScalabilityIncreasing computational costPerformance bottlenecksQuantum analytics

Worked Example: Real-Time Fraud Detection in Financial Systems

Consider a financial institution requiring fraud detection with sub-second inference latency, continuous transaction monitoring, regulatory compliance, and fairness auditing across customer groups. Under these constraints, Apache Flink combined with Kafka Streams supports low-latency event-driven stream processing, while centralized feature stores ensure consistency between model training and real-time inference. MLflow and Kubeflow provide model versioning, deployment orchestration, rollback capability, and reproducibility.

Continuous monitoring includes drift detection, calibration assessment, bias auditing, and performance degradation tracking. Explainability requirements are addressed through SHAP-based interpretation, while immutable audit logs support compliance and governance ­obligations. Carbon-aware scheduling and efficient ­model compression further reduce infrastructure costs and environmental impact. This example demonstrates how infrastructure selection, deployment reliability, fairness monitoring, governance requirements, and sustainability considerations must be optimized jointly rather than independently (Table 7).

Table 7: Architecture decision framework for contemporary data science systems.
ConstraintRecommended Architecture
Low-latency real-time inferenceApache Flink + Kafka Streams
Large-scale batch analyticsApache Spark + Delta Lake
Python-native distributed AIRay or Dask
Strong governance and complianceLakehouse + Federated learning
Cost-sensitive analyticsDuckDB + Object storage
Large-scale LLM applicationsVector database + RAG + GPU orchestration
Privacy-sensitive healthcare analyticsFederated learning + Differential privacy

This design-oriented framework translates theoretical concepts into deployable architectural guidance for researchers and practitioners. Collectively, the proposed unified taxonomy and accompanying design playbook extend beyond conventional descriptive surveys by integrating infrastructure engineering, distributed analytics, AI operationalization, governance, observability, and sustainability considerations into a cohesive systems-level framework. This integration enables researchers and practitioners to evaluate architectural trade-offs across the entire data science lifecycle, thereby supporting the development of scalable, reliable, interpretable, and operationally sustainable data-driven systems.

Concluding Perspective

We examined research on data analytics in this work, ranging from conventional data analysis to the more current big data analysis. Three components make up the KDD process, which serves as the basis for these investigations from a system perspective: input, analysis, and results. The performance-oriented and results-oriented challenges are the main topics of discussion from the standpoint of the big data analytics framework and platform. This paper provides a brief overview of data and big data mining algorithms, including clustering, classification, and common patterns mining technologies, from the standpoint of data mining problems. We have benefited much from hypothesis-driven research and methods for developing theories.

However, there is a lot of data coming from our surroundings where these conventional methods of identifying structure do not scale well or take advantage of observations that would not occur under controlled circumstances. For instance, controlled experiments have helped identify many disease causes in the medical field, but they might not accurately reflect the complexity of health.20,29 In fact, some estimates state that up to 80% of the circumstances in which a drug might be prescribed—such as when a patient is taking numerous medications—are excluded from clinical trials. Big data makes it possible to identify the causal models producing the data when we are able to run randomized experiments.

Big data makes it possible for a machine to ask and validate intriguing questions that people might not think of, as demonstrated earlier in the diabetes-related health-care example. In fact, this capacity serves as the basis for developing predictive modeling, which is essential for making practical business decisions.30 Data offers a previously unheard-of possibility for knowledge discovery and theory creation in many data-starved fields of study, particularly health care and the social, ecological, and earth sciences. The diversity and scope of data currently available in these areas are unprecedented.

The integrated skill set described here as crucial for young data scientists is required in this new environment. A portion of these abilities is taught in computer science, engineering, and business management schools, but the integration of skills required to work as a data scientist or effectively manage data scientists has not yet been covered. Universities are rushing to fill the gaps and offer a more comprehensive skill set that includes fundamental knowledge of computer science, statistics, causal modeling, problem formulation, isomorphs, and computational thinking.

The business models of Internet-based, data-driven companies increasingly rely on predictive modeling and machine learning. Due to its capacity to anticipate the distribution of losses for each transaction and take appropriate action, PayPal, an early success, was able to capture and dominate consumer-to-consumer payments. This data-driven ability was in sharp contrast to the prevailing practice of treating transactions identically from a risk standpoint. Google’s search engine and a number of other products are based on predictive modeling. However, IBM’s Watson, which heavily relies on learning and prediction in its problem-solving process, is probably the first machine to pass the Turing test and generate discoveries.31

In a game like “Jeopardy!,” where the domain is open-ended and nonstationary, and the question itself is frequently difficult to understand, it is impractical to succeed through a lengthy list of options or top–down theory development. Giving a computer the capacity to automatically train itself using a vast number of instances is the answer. Watson also showed how the availability of excellent, human-curated data, such as that found on Wikipedia, significantly increases the capacity of machine learning. Combining machine learning with human knowledge is another trend that seems to be growing. In order to help the machine comprehend the entities that correlate to the deluge of strings it continuously processes, Google has ventured into the Knowledge Graph.32

Google seeks to comprehend “things,” not merely “strings.”28 Managers and organizations have a difficult time adjusting to the new data landscape. Many of their well-established intuitions can now be tested, experiments can be conducted correctly and affordably, and decisions can be made based on evidence. ­Organizations that have embraced the rising world of data for decision-making exhibit a fundamental transformation in organizational culture, which is necessary to take advantage of this potential. Supplementary Appendix A maps all figures, architectural layers, comparative tables, and representative frameworks to their corresponding foundational and contemporary references to improve reproducibility, citation traceability, and technical interpretability.

Declarations

Ethics approval and consent to participate: This review was conducted in accordance with ethical standards for academic research and publication. The authors confirm that no human participants or animals were involved in the creation of this review paper, and therefore, ethical approval was not required. All sources and references have been properly cited to acknowledge the contributions of other researchers. The authors have adhered to best practices for transparency, integrity, and objectivity in the preparation and presentation of this review.

References
  1. Lyman P, Varian HR. How Much Information? University of California; 2003.
  2. Zaharia M, Chen A, Davidson A, et al. Delta Lake: high-performance ACID table storage over cloud object stores. Proc VLDB Endow. 2021;14(12):3411–3424. https://doi.org/10.14778/3476311.3476364
  3. Armbrust M, Ghodsi A, Xin RS, Zaharia M. Lakehouse: a new generation of open platforms that unify data warehousing and advanced analytics. CIDR; 2021.
  4. Laney D. 3D data management: controlling data volume, velocity and variety. META Group Research Note; 2001.
  5. Akidau T, Bradshaw R, Chambers C, et al. The dataflow model. Proc VLDB Endow. 2015;8(12):1792–1803. https://doi.org/
    10.14778/2824032.2824076
  6. Karau H, Warren R. High Performance Spark. O’Reilly Media; 2021.
  7. Zaharia M, Xin RS, Wendell P, et al. Apache Spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65. https://doi.org/10.1145/2934664
  8. Breck E, Polyzotis N, Roy S, et al. Data validation for machine learning. MLSys. 2019.
  9. Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Magazine. 1996;17(3):
    37–54. https://doi.org/10.1609/aimag.v17i3.1230
  10. Sculley D, Holt G, Golovin D, et al. Hidden technical debt in machine learning systems. NeurIPS. 2015;2:2503–2511.
  11. Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and risks of foundation models. arXiv. Preprint posted online August 16, 2021. https://doi.org/10.48550/arXiv.2108.07258
  12. Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv. Preprint posted online May 22, 2020. https://doi.org/10.48550/arXiv.2005.11401
  13. Kairouz P, McMahan HB, Avent B, et al. Advances and open problems in federated learning. Found Trends Mach Learn. 2021;14(1–2):1–210. https://doi.org/10.1561/2200000083
  14. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. KDD, 2016;785–794. https://doi.org/10.1145/2939672.2939785
  15. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–2830.
  16. Abadi M, Barham P, Chen J, et al. TensorFlow: a system for large-scale machine learning. OSDI. 2016;265-283.
    https://doi.org/10.5555/3026877.3026899
  17. Meng X, Bradley J, Yavuz B, et al. MLlib: machine learning in Apache Spark. J Mach Learn Res. 2016;17:1235–1241.
  18. Chen M, Mao S, Liu Y. Big data: a survey. Mobile Netw Appl. 2014;19:171–209. https://doi.org/10.1007/s11036-013-0489-0
  19. Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?” Explaining predictions of any classifier. KDD. 2016;1135–1144. https://doi.org/10.1145/2939672.2939778
  20. Imbens GW, Rubin DB. Causal inference in statistics, social,
    and biomedical sciences. Cambridge University Press; 2015. https://doi.org/10.1017/CBO9781139025751
  21. Lundberg SM, Lee SI. A unified approach to interpreting
    model predictions. NeurIPS. 2017. https://doi.org/10.48550/arXiv.1705.07874
  22. Molnar C. Interpretable machine learning. 2nd ed. Shroff/Molnar; 2022.
  23. McMahan HB, Moore E, Ramage D, et al. Communication-efficient learning of deep networks from decentralized data. AISTATS. 2017. https://doi.org/10.48550/arXiv.1602.05629
  24. Bommasani R, Hudson DA, Adeli E, et al. On the opportunities and risks of foundation models. arXiv. Preprint posted online August 16, 2021. https://doi.org/10.48550/arXiv.2108.07258
  25. Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. NeurIPS. 2020. https://doi.org/10.48550/arXiv.2005.14165
  26. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. NeurIPS. 2017. https://doi.org/10.48550/arXiv.1706.03762
  27. Breck E, Cai S, Nielsen E, et al. The ML test score: a rubric for ML production readiness. IEEE Big Data. 2017. https://doi.org/10.1109/BigData.2017.8258038
  28. Amershi S, Begel A, Bird C, et al. Software engineering for machine learning: a case study. ICSE; 2019. https://doi.org/10.1109/ICSE-SEIP.2019.00042
  29. Pearl J. Causality: models, reasoning and inference, 2nd ed. Cambridge University Press; 2009. https://doi.org/10.1017/CBO9780511803161
  30. Jordan MI, Mitchell TM. Machine learning: trends, perspectives, and prospects. Science. 2015;349(6245):255–260.
    https://doi.org/10.1126/science.aaa8415
  31. Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016. https://doi.org/10.7551/mitpress/10277.001.0001
  32. Gandomi A, Haider M. Beyond the hype: big data concepts, methods, and analytics. Int J Inf Manage. 2015;35(2):137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Appendix
Supplementary Figure S1 | PRISMA-inspired workflow illustrating the literature identification, screening, eligibility assessment, and inclusion process used in this narrative review of contemporary data science and big data analytics systems (2019–2026).
Supplementary Figure S1: PRISMA-inspired workflow illustrating the literature identification, screening, eligibility assessment, and inclusion process used in this narrative review of contemporary data science and big data analytics systems (2019–2026).

Supplementary Table S1: Included Studies Used in the Narrative Synthesis

Ref. No.First Author (Year)Thematic CategoryMethodological FocusPrimary Contribution Area
1Lyman (2003)Digital Information GrowthGlobal information measurementEarly digital data expansion analysis
2Zaharia (2021)Data Infrastructure & Lakehouse SystemsACID lakehouse architectureDelta Lake cloud-native storage
3Armbrust (2021)Unified Analytics PlatformsLakehouse architecture integrationUnified warehousing and analytics
4Akidau (2015)Distributed Stream ProcessingStream and batch processingDataflow computational framework
5Laney (2001)Big Data Foundations3Vs conceptual frameworkVolume, velocity, and variety
6Karau (2021)Distributed Analytics OptimizationSpark performance engineeringHigh-performance Spark analytics
7Zaharia (2016)Distributed Computing FrameworksCluster-scale analytics engineApache Spark unified engine
8Breck (2019)MLOps & Reliability EngineeringML data validationProduction-ready validation pipelines
9Fayyad (1996)Knowledge Discovery & Data MiningKDD process formalizationKnowledge discovery framework
10Sculley (2015)ML Systems EngineeringTechnical debt analysisReliability risks in ML systems
11Bommasani (2021)Foundation Models & AI GovernanceFoundation model assessmentRisks and opportunities of foundation models
12Lewis (2020)Retrieval-Augmented AIRetrieval-enhanced NLP pipelinesRAG architecture
13Kairouz (2021)Privacy-Preserving AIFederated learning surveyOpen challenges in federated learning
14Chen T (2016)Machine Learning AlgorithmsGradient boosting optimizationXGBoost scalable learning
15Pedregosa (2011)Machine Learning FrameworksOpen-source ML toolkitScikit-learn ecosystem
16Abadi (2016)Deep Learning InfrastructureLarge-scale neural computationTensorFlow architecture
17Meng (2016)Distributed Machine LearningParallelized ML computationMLlib scalable learning framework
18Chen M (2014)Big Data Systems SurveyDistributed analytics surveyBig data architecture overview
19Ribeiro (2016)Explainable AILocal model explanationsLIME interpretability framework
20Lundberg (2017)Explainable AIFeature attribution explainabilitySHAP framework
21Molnar (2022)Explainable AIInterpretable machine learningExplainability methodologies
22McMahan (2017)Federated LearningCommunication-efficient optimizationDecentralized deep learning
23Kairouz (2021)Federated LearningLarge-scale federated systemsResearch directions and limitations
24Bommasani (2021)Responsible AI & GovernanceAI safety and societal impactFoundation model governance
25Brown (2020)Large Language ModelsTransformer-based language learningFew-shot LLM capabilities
26Vaswani (2017)Deep Learning ArchitecturesAttention mechanismsTransformer neural architecture
27Breck (2017)ML Reliability EngineeringML deployment readinessML production evaluation rubric
28Sculley (2015)Production ML EngineeringOperational ML maintenanceHidden technical debt analysis
29Amershi (2019)Software Engineering for AIIndustrial ML deployment lifecycleEnterprise ML engineering practices
30Pearl (2009)Causal InferenceStructural causal reasoningProbabilistic causality frameworks
31Imbens (2015)Applied Causal InferenceStatistical causal modelingBiomedical and social causal analysis
32Jordan (2015)Machine Learning Research TrendsAI synthesis and forecastingFuture ML research directions
33Goodfellow (2016)Deep Learning FoundationsNeural network methodologiesFoundational deep learning theory
34Gandomi (2015)Big Data AnalyticsBig data conceptual synthesisBeyond-3Vs analytics framework

Appendix A. Reference Mapping for Figures, Tables, Frameworks, and Representative Technologies

Appendix A1. Figure-to-Reference Mapping

FigureDescriptionKey Concepts / TechnologiesFoundational References
Figure 1End-to-end modern data science lifecycleKDD workflow, ingestion, preprocessing, analytics, deploymentFayyad et al. (1996); Jordan & Mitchell (2015)
Figure 2Distributed analytics and AI infrastructureSpark, TensorFlow, MLlib, stream processingZaharia et al. (2016); Abadi et al. (2016); Meng et al. (2016)
Figure 3Expanded characteristics of big data3Vs, veracity, value, validity, variabilityLaney (2001); Gandomi & Haider (2015); Chen et al. (2014)
Figure 4Unified taxonomy for modern data science ecosystemsLakehouse, MLOps, RAG, governance, observabilityArmbrust et al. (2021); Zaharia et al. (2021); Bommasani et al. (2021); Lewis et al. (2020)

Appendix A2. Table-to-Reference Mapping

TableMain ThemeSupporting References
Table 1Evolution of data science and big dataFayyad et al.; Laney; Lyman & Varian
Table 2Distributed analytics frameworksSpark, Flink, TensorFlow, MLlib
Table 3Machine learning algorithms and frameworksXGBoost; Scikit-learn; TensorFlow
Table 4Comparative analytics framework evaluationSpark; Dataflow; Ray; Dask literature
Table 5Explainable AI and responsible AI frameworksRibeiro; Lundberg; Molnar; Bommasani
Table 6MLOps and deployment engineeringBreck; Sculley; Amershi
Table 7Federated learning and privacy-preserving AIMcMahan; Kairouz
Table 8Future trends in AI systems and governanceBrown; Vaswani; Pearl; Jordan

Appendix A3. Framework and Technology Mapping

Technology / FrameworkFunctional RoleRepresentative References
Delta LakeACID lakehouse storageZaharia et al. (2021)
Lakehouse ArchitectureUnified analytics platformArmbrust et al. (2021)
Apache SparkDistributed analytics engineZaharia et al. (2016)
TensorFlowDeep learning infrastructureAbadi et al. (2016)
MLlibDistributed machine learningMeng et al. (2016)
XGBoostScalable boosting algorithmChen & Guestrin (2016)
Scikit-learnClassical machine learning toolkitPedregosa et al. (2011)
LIMEExplainable AIRibeiro et al. (2016)
SHAPExplainable AILundberg & Lee (2017)
Federated LearningPrivacy-preserving distributed AIMcMahan et al. (2017); Kairouz et al. (2021)
RAGRetrieval-enhanced generationLewis et al. (2020)
Foundation ModelsGenerative AI systemsBommasani et al. (2021)
Transformer NetworksAttention-based deep learningVaswani et al. (2017)

Appendix A4. Foundational Seminal Literature Integrated Throughout the Manuscript

Foundational TopicSeminal Reference
Knowledge Discovery in Databases (KDD)Fayyad et al. (1996)
Big Data 3VsLaney (2001)
Digital Information ExplosionLyman & Varian (2003)
Big Data Analytics ConceptsGandomi & Haider (2015)
Big Data Systems SurveyChen et al. (2014)
Machine Learning Research TrendsJordan & Mitchell (2015)
Deep Learning FoundationsGoodfellow et al. (2016)
Explainable AIRibeiro et al. (2016); Lundberg & Lee (2017)
Federated LearningMcMahan et al. (2017)
Foundation ModelsBommasani et al. (2021)
Retrieval-Augmented GenerationLewis et al. (2020)
Causal InferencePearl (2009); Imbens & Rubin (2015)


Premier Science
Publishing Science that inspires