In Pursuit of Optimality
September 24, 2020 • 10 minute read

In pursuit of optimality

This blog is written courtesy of Interactions R&D team.

Technological shifts during the past three decades in computing, including personal computers, hardware acceleration, and ubiquitous computing, complemented by a data revolution — collection, aggregation, storage, and dissemination made possible by ever expanding communication network infrastructure– have had a profound and transformative impact on our daily lives. Scientific research in speech and language has not been immune to these transformations, either. 

In the early years of what came to be known as the field of Artificial Intelligence (AI), scientists and researchers attempted to model machines that were designed to mimic human behavior through an analysis of human sensory and reasoning systems. Early speech recognition systems were based on the psychoacoustic aspects of human speech perception; models were developed through an understanding of the human ear. The first generation of speech synthesis, articulatory speech synthesis, was based on an analysis of the human speech production system. Natural language analysis and generation systems were grounded in a mathematical analysis of the syntax of human language. In later years, dialog systems were designed around knowledge representations in formal logic frameworks through a study of the cognitive aspects of reasoning. The underlying hypothesis was that models of AI systems would benefit from a deeper scientific understanding of human faculties.

While the sixties to the eighties saw some of the foundational work in computational frameworks in artificial intelligence such as formal frameworks of language analysis, knowledge representation languages, frameworks for formal reasoning, the diversity of ideas also presented a significant challenge to the field. Several proof-of-concept AI prototypes blossomed; each built to demonstrate a niche idea, but the breadth of their applicability, their robustness to real-world scenarios, and interoperability of ideas were seldom evaluated. Consequently, growing disillusion of knowledge-driven AI in delivering on its promises ensued. Concurrently, fueled by low-cost computing and data storage, that coincided with the advent of massively scaled computer networks — the Internet, a new paradigm of data-oriented AI started to take root.

By the mid-eighties, the field of Automatic Speech Recognition (ASR) had already abandoned the pursuit of modeling the human auditory system and had embraced an engineering approach of using audio and transcription data to fit parameters to a generative model that captured the dynamics of acoustic sequences. With the strong requirements of DARPA programs to demonstrate progress objectively, and with the availability of the syntactically annotated Wall Street Journal corpus in the early nineties, the field of Natural Language Processing (NLP) followed the ASR community in adopting a probabilistic view to language analysis, ushering in the era of statistical NLP.  With probabilistic estimation techniques from text and speech resources and evaluation metrics for various subtasks in speech and language processing, these fields evidenced an influx of new entrants from fields such as optimization and machine learning resulting in an unprecedented growth in the number of papers at conferences. The brittle and narrow-domain systems of the eighties gave way to scalable, broad-domain technologies that were trained on annotated data sets, even if shallow in their analyses. Computational grammars that encoded linguistic analyses in well-founded mathematical grammatical frameworks were overshadowed by statistical NLP systems that were trained to produce the annotations in test sets. In conjunction with carefully designed objective metrics for many of the subtasks in speech and language processing, the model performance in NLP systems was easy to measure quantitatively and track progress systematically. 

While by the late nineties, generative models and their parameter estimations from training data had become the mainstay in speech and language technologies, another revolution was around the corner to challenge the status quo once again. Since evaluating model efficacy on a test set using objective metrics had become the norm in the field — word accuracy for ASR, parse accuracy for syntactic analysis, F1 score for entity recognition, and BLEU score for machine translation — the new idea was to directly optimize the evaluation metric to maximize success. Discriminative classification tools from the sister field of machine learning had just come of age and NLP tasks provided a rich set of problems for applying these tools to demonstrate their superiority over generative models, as measured by the objective metrics. The field transitioned to encode a variety of speech and language tasks cast as classification problems so as to leverage the rapidly evolving machine learning tools. A standard template for approaching a language problem emerged. Attributes of speech and language tasks would be extracted from supervised corpora through feature extraction functions designed by subject matter experts in the field, and the best way to combine these features to optimize the objective metric was relegated to classification algorithms. While the discriminative models outperformed generative models on the designed metrics and the supervised corpora set up for tracking progress on the task, these models were being overtrained on those specific corpora and their accuracy on out-of-domain datasets was unexpectedly poor. 

What started in the early nineties using annotated corpora to serve the needs of evaluating language technologies had become the dominant paradigm for building language processing systems at the turn of the century: collect ever-increasing sized corpora in the domain of interest, annotate the corpus with the phenomena of interest, extract features from these annotated corpora, train supervised learning based classification models, evaluate the models and iterate this process if model accuracy was not satisfactory. The realm of scientific inquiry of language universals that was dominant fifty years prior had been overrun by an engineering paradigm, where linguistic knowledge was encoded descriptively by experts trained in the field through the design of appropriate annotation style guides and features for the machine learning tools to digest and regurgitate. The scope of generalization statements had shrunk from universal linguistic principles of human language (e.g. agreement, argument structure, recursive and hierarchical structure)  to statements about the predictive power of the theories/models induced by machine learning algorithms trained on specific sublanguages of task-specific annotated corpora. While the aspirations of the field to generalize beyond the supervised corpora is still evident, the state-of-the-art tools used to this day have not succeeded in providing us with the necessary out-of-domain generalizations that can be termed as universal.

The first decade of the twenty-first century demonstrated the effectiveness of the discriminative classification paradigm through supervised training for language tasks, and the second decade saw the mild waves of neural networks of the fifties become a tsunami, to eventually submerge not only language processing but all subfields of AI. ASR with its decades-old traditions of Gaussian Mixture Models and Hidden Markov Models were swept aside in favor of Hybrid DNN models and End-to-End DNN models, eclipsing meager gains of past years with a quantum leap. The story was no different from Neural Machine translation and for all of the subtasks of NLP. The advent of GPU chips, initially designed for gaming algorithms with optimized matrix-based computing, served to rewrite the checkered history of (deep) neural networks (DNNs) and made them the tool of choice for many AI problems. In addition to the computing infrastructure, the pervasive use of DNNs can also be attributed to a few other factors. The limited computational power available for training the linear discriminative classifiers of the late nineties required a modeler to consult a domain expert to limit and shape the space of possible hypothesis models by providing predictive and usually interpretable features for the learning algorithm. In contrast, proponents of DNNs claimed that DNN modeling freed the modelers from having to design such specialized features, thus obviating the need for highly skilled domain experts, and instead transferred the expertise to machine learning practitioners to setup training regimes, network topologies, and searching the hyper-parameter space. DNNs would implicitly discover optimal features for the task, often outperforming the hand-crafted features given by subject matter experts, even if it came at an increase in computational cost. Not only were DNNs optimizing the objective metric, by learning the weight parameters on the features, they would optimize the features themselves, which are necessary to optimize the objective metric. With ever-faster GPUs, fueled by a culture of open source DNN software frameworks created and supported by behemoth corporations, and a culture of open AI tasks supported by open-sourced data, the field has been driven from leaderboards limited to DARPA participants of the nineties to global leaderboards, with entrants ever tuning DNNs with deeper networks and more complex algorithms in order to out-pace the competition.

The relentless pursuit of optimality on specific tasks as defined by the particular data sets available for that task has taken the field away from seeking optimal solutions in the broader context — generalizing beyond specific data sets and ultimately leveraging universal properties of language. The quest for such data-set and domain agnostic universals would not only reposition the field as a scientific discipline, but also have engineering benefits by creating technologies that are less data-dependent and more domain portable, with interpretability and transparency in their decisions. Data-driven AI has offered many benefits to the field of AI, delivering robust systems for specific tasks with quantifiable metrics and an objective way to track progress. It has opened the coffers of large enterprises supporting unprecedented funding, previously limited by DARPA and NSF funding.  However, AI as a field, and in particular language technologies, has become overly dependent on the availability of data, and more so supervised data. Consequently, only those problems that have openly available data sets consume the imagination of the community. With data for meaningful industrial problems behind closed vaults in enterprises, the utility of the models on open data sets is relegated to benchmarking alternative modeling techniques. Moreover, new problems such as detecting and quantifying bias in our AI models are at the forefront of public conversation, because algorithms are woefully limited in distinguishing the vagaries of the data from the invariant features of the task. The compute intensive, data intensive paradigm of DNN-based modeling has given large multinational enterprises a distinct advantage since they are well-positioned not only to collect data, but also have the resources to annotate the data, to serve as fodder for DNN algorithms. 

This paradigm minimizes the role of subject matter expertise, turning most AI problems into data management, process management, and a trial-and-error exercise in DNN architectures and hyperparameter search that leaves a large carbon footprint.  

Backed by two and a half decades of data-driven technological advancements, AI is now permeating all layers of society. The awe and wonder of early prototypes that had given flight to our collective imagination have been replaced by real-world AI systems that have to deliver accountable, transparent, and interpretable decisions to be trustworthy. Furthermore, with the increased attention to data privacy, the collection, annotation, and management of data is under significant scrutiny, which in turn might alter the course of data-driven AI. These developments might require the injection of prior knowledge and universals, in order to wean AI away from excessive data dependence, and facilitate the introduction of a new and holistic objective function in the pursuit of optimality.