- Artificial Intelligence (AI)
- Backdrop: The Fourth Industrial Revolution
- A brief history of AI
- Feeding the machine
- Artificial Neural Networks
- Large Language Models
- Natural Language Programming (Prompt design)
Artificial Intelligence (AI)
The term AI gets thrown around a lot these days and has been applied to a wide range of things, especially in the business world where everyone wants it, even if they are not really sure what ‘it’ is. Perhaps the first point to note that “intelligence” is a complex word and it is unlikely that the algorithms we have in 2021 approach anything like human concepts of intelligence.
Secondly, “learning” is another complex word that in a human context would imply the existence of some level of consciousness, something that also seems unlikely to occur in significant level in today’s machines. It is important to acknowledge the difficulties in nailing down definitions of intelligence and learning as this impacts on discussions of agency, morality, and intentionality when we discuss AI in relation to humans and the natural world.
“For the present purpose the artificial intelligence problem is taken to be that of making a machine behave in ways that would be called intelligent if a human were so behaving.”McCarthy et al., 1955
At a top level, AI employs algorithms and applies them to sets of data. Often the algorithms are designed by humans (though, this is changing in some areas). Data is a key component of how AI functions and it is usually selected and curated by humans (though this too is changing). It is also (for now) humans who decide when, where, and how AI is deployed. And it is humans, and sometimes the natural world, that are most affected by poor AI design and deployment decisions. So as you can already see, most AI is deeply intertwined with humans and thus is often deeply steeped in our human values and perspectives.
“AI is the designing and building of intelligent agents that receive percepts from the environment and take actions that affect that environment.”Russell & Norvig (2020)
AI is a very broad term and one that has been used loosely by many different people and industries. Some believe that AI should mimic human cognition and some believe that it should exceed it. In reality, what marketing departments usually brandish as AI is far less glamorous than either of these aspirations.
At the onset it is important to understand that there is no single definition of AI that everyone would agree on. There are, however, clearly two main streams. Firstly, there is symbolic AI, or brute force AI, or GOFAI (Good Old Fashioned AI) which dominated the later part of the 20th Century. Secondly, there is neural AI which includes deep learning types of AI and is fast becoming the predominant type of AI we use. Both have their benefits and risks; however, my research focuses on the later, Neural AI.
“Early AI was mainly based on logic. You’re trying to make computers that reason like people. The second route is from biology: You’re trying to make computers that can perceive and act and adapt like animals.”Geoffrey Hinton
This page is a very brief overview of the background and technology of the AI that I research. The page is written to complement the other sections of my website that cover the philosophy and ethics that I focus on.
Backdrop: The Fourth Industrial Revolution (4IR).
The fourth industrial revolution (4IR) is driven by emerging technologies such as: artificial intelligence (AI), big data analytics, neural networks, cloud computing, 3D printing, blockchain, quantum computing, and advanced automation (Schwab, 2017). Notably, it is how these technologies interact and enhance each other that has driven the exponential changes of 4IR.
Along with adjustments for Industry 4.0 the ubiquity of new technologies in our daily lives, such as facial recognition, the Internet of Things (IoT), and mobile supercomputing, signals substantial changes in our social systems. Since the term Industry 4.0 was coined in (Drath & Horch, 2014) and the Fourth Industrial Revolution was declared in 2015 (Schwab) there has been extensive discussion on both of these topics with exact definitions shifting as one would expect of emerging fields. My research takes the view that Industry 4.0 is a subset of 4IR but recognises that some may dispute this view.
There is no doubt that 4IR is rapidly heralding so many emerging technologies that it has become difficult for our governments, social scientists, rule and policy-makers, and others to keep up. The move fast and break thing era was just that. It moved very fast…and it broke a lot of things. On the Ethics page of this site, I have given some case studies that highlight specific examples.
The field of emerging technologies is far too broad to adequately describe here, so I will follow a path that will lead the reader to the exact technologies I am currently working on, Large Language Models.
A brief history
The field of AI rose from the work of cyberneticians in the 1940s: a key text “Control and communication in the animal and the machine” was published by Norbert Wiener (1948). Cybernetics is focused on feedback systems and took much inspiration from the field of biology. Early developers included Alan Turing and John Von Neuman, and others inspired by the human brain and the way it thinks and reasons.
A major chronological landmark is the 1956 Dartmouth summer workshop which included 20 founding AI theorists. It was the organiser McCarthy who coined the term Artificial Intelligence for the conference, in part as a way to form a distinction from the then dominant field of cybernetics. It is perhaps important to note that at this pivotal workshop that set the tone for the development of the field, there were no women, all attendees were white males from a small group of elite USA institutions such as Princeton, MIT, and IBM. There is no doubt that all of these scholars had excellent intentions, but as we now know, such homogenous collaborative workshops almost always fail to consider the perspectives of excluded groups.
The 1960s saw a boom in the field. The explosion of research was particularly due to the US military research agency DARPA (the same one that funded the development of the Internet) pouring in large sums of money. Unfortunately, the speed of innovation could not keep pace with the hype. The expectations of the government funders outpaced the ability of the technology to realistically develop and an AI winter ensued from the early 1970s to at least the mid 1980s, or by some measures to the end of the 20th Century.
Perhaps the next most notable milestone was when IBM’s Deep Blue became the first computer to beat a world chess champion, beating Kasparov in 1997, although this machine win was still primarily based on brute computational force. Deep Blue is emblematic of expert or narrow AI or symbolic AI; that is, AI powered by decision-trees of conditional statements such as if…then, a process that is very deterministic.
The moment a machine beat the world’s best human chess master was significant, particularly as viewed from the Western eye. Deep Blue was no chess playing Mechanical Turk of the 18th century in which a man hid under the table of a completely unintelligent automaton guiding its every move. Deep Blue used algorithmic processes to compute the best moves. In 2016 we saw a machine beat the world’s best human Go master: another significant moment, particularly as viewed from the Eastern eye.
The AlphaGo victory was more astounding in that the game of Go is orders of magnitude more complex. A typical game of chess is around 80 moves with about 10123 possible moves. A typical game of Go is around 150 moves with about 10360 possible moves (Koch, 2016). For context, there are an estimated 1078 to 1082 atoms in the known, observable universe. The vast increase in possibilities in Go requires a different approach, one that is more intuitive. This sense of intuition leads many in Eastern cultures to see Go as a way of touching the universe.
It is important to note here that Go is not just more complex than chess, it is different. Often the progression of machines beating humans at chess than at Go is seen as a progressing of difficulty, it is more than that, it required a fundamentally different approach. The AI that beat Chess Grand Master in 1997 was an entirely different species to the AI that beat the Go Grand Master in 2016. The earlier, Deep Blue, was powered by symbolic AI, or brute force AI, crunching possible combinations till it found the best next move. The later AI, AlphaGo, was driven by neural network AI. The scale of possibilities involved in Go mean that brute force computation is unfeasible. Patterns and relationships between patterns become more important in Go; human masters speak of ‘heavy’ and ‘light’ patterns and seem to intuit latent possibilities in the relationships of the stones. Pattern recognition is what deep learning excels at.
“It is ironic that the most powerful techniques for this (Go) fully deterministic game—in which every move is entirely determined based on earlier moves—are probabilistic.”Koch, 2016
AlphaGo was allowed to train itself in a process called reinforcement learning. A technique that enabled the neural net to explore the game in ways that no human ever had. Once again, this breakthrough technique was inspired by biology, from the psychology of behaviourism. It works in much the same way as training your dog; reinforce the correct behaviour until it associates that behaviour with reward (i.e. a winning game). IBM trained Deep Blue by labelling ‘good’ and ‘bad’ board positions. Deep Mind let AlphaGo figure it out for itself. Due to this entirely different approach, AlphaGo was able to come up with what has been called an un-human, yet winning move: move number 37.
To understand AlhpaGo and Artificial Neural Networks (ANNs) we need to go back to those early days of the 1940s and 1950s when there was a split between symbolic AI and connectionist (the ancestor of ANN) AI. Warren McCulloch and Walter Pitts published a model for neural networks in 1943 in a biophysics journal, showing the fields early inspiration from our own human neural architecture. In 1949 Donald Hebb developed unsupervised learning (later dubbed Hebbian Learning), the idea being that neural networks could learn through successive loops. Work in neural networks inspired Rosenblatt to create the Perceptron in 1958 which was able to do basic pattern recognition. Though at the time, machines simply didn’t have enough processing power to adequately progress this line of development.
Over the next few decades, a dedicated community continued working on connectionist or neural network AI despite most funding and interest going to Symbolic AI. Symbolic AI was producing results and loud voices in that field (i.e. Marvin Minksy) had cast shade on the connectionists as if they were quacks. One notable connectionist development was a technique called backpropogation, developed throughout the 1970s and culminating in a paper by Rummelhart, Williams, and Hinton in 1986.
Though the term Machine Learning (ML) was coined in 1959 by Arthur Samuel (noting the term doesn’t appear in published work till the late 1960s), it was not until the 1990s that an important shift was made from knowledge-driven ML to data-driven ML. ML is a type of AI that uses probability theories. Generally, ML requires training data that has been labelled to provide predictions on unlabelled data. ML uses various levels of human supervision to recognise patterns in large datasets.
Deep Learning is a class of ML and was coined by Rina Dichter in 1986. Deep learning algorithms are usually based on artificial neural networks (ANNs) and the deep refers to the number of layers in the ANN through which data is processed. LeCun, Bengio and Hinton published an important paper in the field of deep learning in 2015.
The 21st century has seen key developers including Geoffrey Hinton (University of Toronto, and Google), Yan Le Cunn (NYU and Facebook), Jeff Dean and Andrew Ng (Google Brain), and Yoshua Bengio (University Montreal), make significant breakthroughs in neural networks. Some of the most exciting advancements in the last few years are in the field of language models. One of the first large language models was published by Google in 2018 and is called BERT. Of course, many big tech companies have scrambled to catchup and surpass Google in the language model race. In 2019 Elon Musk’s company OpenAI released their large language model (LLM) called GPT-2 then trumped themselves in 2020 with the release of GPT-3.
GPT-3 is a key focus of my work having gained access to the model for research purposes in May 2021. GPT-3 is a language model AI that is designed on a transformer-based deep learning neural network architecture. At the time of its release in mid-2020 it was 10 times larger than the next largest language model, Microsoft’s Turing NLG. Training data fro GOT-3 was essentially the Internet, Wikipedia, and other texts – inclusive of all the toxicity and prejudice contained in those sources. Its text generation capabilities are enormous, as can be enjoyed in the interview with the AI model below.
Of course, the story is far from over! This article from 13th June 2021 takes a look at the next successors to GPT-3. The new world title as at 1st June 2021 being held by China and their Wu Dao 2.0 model.
Feeding the machine
AI is powered by data. The bigger the AI, the more data it requires. What that data may be is very broad. It can be simple and obvious such as figures on a person’s banking history or it can be more abstract such as when we click on social media posts that interest us. Data is not just the domain of the Sciences. Data is immensely important in social sciences, law, business, economics, and many other fields. We all produce data every day; researchers produce even more!
Data is rarely, if ever, neutral. Data is pulled from an abstraction decision people make from the entire dataset of all that goes on and exists. Often data collections is flawed by societal bias (i.e. race and gender bias) baked in by research design. How datasets are created, selected, and abstracted from the larger reality, can often be the result of unequal power structures in our societies.
AI systems analyse data then make decisions or predictions based on computational functions (i.e. algorithms) humans give the systems. Humans sometimes guide analysis, or in the case of deep learning pattern recognition that the AI agent learns itself. Recognition of patterns in data is a key feature of AI systems. Usually, the most prized goal is high-quality predictions on new data presented to the AI agent. In essence, the goal of deep network AI is to scour data for patterns and clusters using statistics to convert observations to generalised category descriptions.
In the case of standard symbolic AI, the most common AI in 2021, pattern recognition is usually trained or supervised. Humans label training data for example, “this is a cat” or “this is a dog”. Of course, that is a very simplistic example, but the principle still stands. Contrary to shiny tech-utopian visions of Silicon Valley, much of this data training is done by humans in low-paid and poor conditions. These workers have been dubbed Ghost Workers (Gray, 2019) and represent the new tech version of the clothing and manufacturing sweatshops of the 20th century.
There are many organisations that arrange this kind of labour, perhaps one of the most infamous is Amazon’s Mechanical Turk. The Mechanical Turk works as a gig economy similar to Uber. A 2017 study of 2,676 workers on The Mechanical Turk showed the median wage of people engaged in this work to be $2 per hour although this figure doesn’t account for the emotional toll or likely PTSD of viewing so many extreme images.
Amazon’s Mechanical Turk has also been the research resource for many published academic papers whose authors didn’t have the funds to otherwise complete their research more ethically. It is disturbing that not only are papers published that rely on ghost work, but that many academics laud the method without understanding the impact both to the workers and the quality of the research produced.
Much of the AI we use as of 2021 requires human input. Humans design the AI system, code the algorithms, create the data, decide on data abstraction boundaries, decide how to collect the data, tag the data, decide where and when to deploy the tool, and judge the outcomes. AI does not function in isolation from humans, but gains meaning and power from the way we deploy these tools in our sociotechnical systems
Artificial Neural Networks
We have created Artificial Neural Networks (ANNs) in our own image – at least what we understand it to be! ANNs are modelled on our understanding of how the human brain works. Humans take in information from the external world, process it, reason about it, and then usually decide to output some kind of action. For example: my eyes tell me there is a big furry object coming towards me, from my memory this fits the description of a bear, bears sometimes eat people, I don’t want to be eaten so I need to get away, legs run! We modelled ANNs on the same principles.
From even the most fundamental building structures of ANNs, we have made them reflect how we model and process the perceptions of our world. As a human myself, this does seem eminently reasonable; however, I can’t help but think it is a very anthropogenic way of structuring an artificial reasoning model. For instance, I can imagine a superintelligence of ants that might model their ANNs on something more resembling a hive mind. The point is, we are reflecting our understanding of what we think sets us apart from the rest of the animal kingdom into our machine creations.
ANNs use multiple layers of processing units built on the structure shown above to create connected levels of reasoning. Information/input/data comes into the network from the outside and is processed through numerous hidden layers by looping through sequences of probability weightings. Each connection is weighted and the higher the weighting (from 0 to 1) the more influence it has on connected units. Eventually, the hidden layers produce and output that is (hopefully) usable. Additionally, similar to the human brain which processes information hierarchically, each hidden layer in a ANN provides output information to the next, more senior, layer.
As mentioned above, ANNs often rely on training data. For instance, many pre-labelled images of “dogs” usually labelled by humans. When an ANN comes up with a label it will compare its result to its training data and if it is wrong it will use back-propagation to correct its weightings, going back through the multiple hidden layers.
Deep learning just refers to the number of layers in a neural net. Any ANN with more than three layers would be considered ‘deep’. Deep learning tends to require less human training than classical machine learning. It can often look at unstructured data and identify new patterns itself (think AplphaGo). The cost is that Deep Learning requires a lot more data than classical machine learning.
Neural Nets is a fast-moving and exciting field. There is much speculation about what lays around the corner. As with any nascent field, there are many conflicting voices on which way the technology will trend and what uses it will have; though everyone seems to agree that the interest is growing. A major academic publication site called arXiv has shown a six-fold increase in five years of Deep Learning publications (Takimoglu, 2021).
Some types of ANNs
There are many types of ANNs today. I will briefly touch on some of the most popular ANNs and those relevant to my research.
Convolutional neural networks (CNNs) are based on the principles of linear algebra to recognise patterns in images. These types of image processing ANNs were developed by Yan LeCun back in the 1980s. The network breaks down images into small pieces and analyses parts of those smaller images, for instance an edge or a shadow. Today, they are often used for facial surveillance technologies and computer vision for instance in autonomous vehicles. They are also used in natural language processing (NLP).
Recurrent neural networks (RNNs) rely heavily on feedback loops where each node is able to retain a memory from the previous step. It learns which predictions were wrong and improves itself. They are used for predictions about time-series data such as sales forecasts and text-to-speech applications.
Generative adversarial networks (GANs) are being used to create synthetic data. The synthetic data is then used for training which circumvents the need for humans to collect large amounts of real-world data. Ian Goodfellow pioneered GANs in 2014. Essentially, two neural nets play against each other to train so that new data can be ‘generated’. This approach has both pros and cons. Privacy can be better protected, but then we are using entirely artificial data to make models and predictions that may impact the very real world we live in. GANs have also given rise to deepfake images and videos.
Developed in 2017, transformer neural networks are particularly applied to natural language processing (NLP) tasks: though, it is likely this breakthrough technology will soon be applied to more tasks. They are a big step forward from CNNs and RNNs as they don’t just look for patterns but are able to identify (somewhat limited) context and generate relevant text. The shift from simply pattern recognition toward context recognition is a major milestone toward the development of Artificial General Intelligence (AGI).
To achieve context this new technology uses something called attention. Transformers use the concept of “attention” which had previously been used in RNNs; however, researchers at Google (Attention is all you need, 2017) showed that Transformer attention techniques alone was more efficient than RNNs with an attention component. The reason is that RNNs with attention whilst good at analysing relationships between words in a sentence that are close together, start to struggle when key words are farther apart in the same sentence, or worse in adjacent sentences. Attention uses mathematical modeling to provide weights of how strongly words are connected in a sentence(s).
Very simplistically: CNNs are good at looking for patterns in images. RNNs are good at analysing linear streams such as text, but they get lost if the streams are too long. For a little while an approach called long, sort-term memory (LSTM) was used to patch these problems and increase feedback between neurons. LSTM improved algorithmic performance but only slightly improved text performance. Attention was added to RNNs to strengthen connections between words but it was soon discovered that attention worked better on text without RNNs. Now, attention encodes the context of a datapoint and keeps that connected. You could say, transformers look at the forest and the trees at the same time. The result is increased understanding and speed of processing text.
This type of NN looks both at what each word in a sequence means and which are more important. That means they perform better at understanding the relationships between words as well as processing text faster as they can focus on the more important parts.
A generative transformer model means that the AI can predict or “generate” the next word or “token” in a sentence or “sequence”. That means a generative model can complete a sentence, paragraph, or continue writing an essay or story when given just a few opening lines.
Whilst my work explores the need for current Transformer NN models to greatly improve their ability to understand context in natural language, this technology is revolutionary. Google’s DeepMind division has been using the technology to model how amino acids fold to form proteins.
‘It will change everything’: DeepMind’s AI makes gigantic leap in solving protein structures.”Callaway, 30 November 2020, Nature
Due to the enormous size and training requirements of transformer models the technology remains unaccessible to the vast majority of the world. Thus, some large tech companies have created ‘pre-trained transformer models’. Though, even those are often restricted to a select few, or are limited in the amount of text they can handle and process at any one time. Additionally, there is significant bias built into these models by the nature of the large training data they are fed and how these models work. A core focus of my research. Natural language is messy. Human culture and meaning even more so. Further the meanings and contexts humans ascribe to words changes by group, nation, language, and time. That is not to suggest we shouldn’t take advantage of this amazing technology, just that we should do so with eyes wide open.
In a language model AI words are encoded to numbers. Not in a simple way such as apple=1, bravo=2, charlie=3. That would be far to cumbersome and prohibit contextual understanding. Words are given vectors based on things they are related to such as King is related to: male, monarch, ruler. Most words have hundreds of vectors. Before transformers, words were encoded into vectors then handled by RNNs. The attention and context given by transformers helps AI language models deal with polysemes, i.e. all the different things that “it” could mean in a sentence.
A large language model (LLM) is a deep learning AI model that is trained on a very large corpus of text. A language model uses statistics to predict the next word in a sentence or next token in a sequence. LLMs use transformer AI technology and do not require human data labelling.
BERT – Google
BERT was published in 2018 by Jacob Devlin and his team at Google. It was pre-trained with using a book corpus of 800 million words and English Wikipedia of 2,400 million words. BERT has been primarily used for search engines and SEO website tagging. BERT stands as an important step along the transformer driven LLM path to improve contextual understanding of AI models.
GPT – OpenAI
Launched in July 2020 by OpenAI a company initially set up by Elon Musk and Sam Altman in 2015 and now governed by a board that includes Greg Brockman (Chairman & CTO), Ilya Sutskever (Chief Scientist), and Sam Altman (CEO). In 2019 Microsoft invested $1billion into the company and in September 2020 Microsoft gained exclusive rights over GPT-3.
“OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity.”OpenAI
GPT-3 has 175 billion parameters in its model and was trained on 570 GB of text including most of the internet, English language Wikipedia, and two large corpuses of books (93% in English). GPT-3 can perform tasks it was not trained to do and exhibits meta-learning with few-shot examples. That is it can deduce the linked meaning or intent between pairs of words.
Due to safety concerns of how this model could be used, OpenAI has restricted access at this stage. I was granted access to the model in early 2021 and the model has become a key site of my research.
Whilst the initial release of GPT-3 was accompanied by a paper titled “Language models are few-shot learners“, the model has also shown excellent capabilities with zero-shot prompting. That is, instead of providing a few examples to learn from, a well designed prompt can achieve excellent results as well.
By using tropes that humans often use, such as bullet points, colons, Q&A formats, or even well known story openings, GPT-3 can be given the space to create.
For more GPT-3 generated text head to my blog Chats with GPT-3.
Other uses for Transformers
The usefulness of context afforded by attention in transformers can be used for a wide variety of applications as is currently being discovered. Transformer technology is being used on sequential data such as: code, proteins in a molecule, genese, playlists and online behaviors. They can also pick up anomalies in datasets without needing to be trained such as in medical imaging or fraud detection.
Another model created by OpenAI, DALL-E creates images from text descriptions. It uses natural language prompts the same as GPT-3.
OpenAI also created CLIP, a model that produces a text description for an image.
The BigScience project hosted by Hugging Face is another group I am involved in. Hugging Face is an open-source provider of natural language processing (NLP) technologies and the BIgScience project is one project run on HuggingFace. The BigScience project is “A one-year long research workshop on large language models: the Summer of Language Models 21”
The project aims to create, share and evaluate a very large multilingual dataset and a very large language model as a tools for research. The workshop aims to “foster discussions and reflections around the research questions surrounding large language models (capabilities, limitations, potential improvements, bias, ethics, environmental impact, role in the general AI/cognitive research landscape) as well as the challenges around creating and sharing such models and datasets for research purposes and among the research community.” The BIgScience project is working on creating an LLM with 210 billion parameters and a more diverse training dataset.
The workshop is organised as a collection of working groups that connect with each other. One of the founders, Thomas Wolf, says he was inspired by the format of CERN and the Large Hadron Collider. There are 500 researchers working on this project and I am very proud to be one of them.
Natural Language Programming
Natural Language Programming (not to be confused with natural language processing) is away of programming a computer in a natural language such as English. This has not been possible until the advent of Transformer technology. The technique is also called prompt programming, prompt design, and even Software 3.0 (traditional coding being software 1.0 and neural networks software 2.0).
The technique is so nascent that most text written about it can only be found on blogs or in communities of engineers and enthusiasts working with large language models. Here are some resources
- Methods of prompt programming
- Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm
- Software 3.0 — How Prompting Will Change the Rules of the Game
- Prompts As Programming
- The Power of Scale for Parameter-Efficient Prompt Tuning
- The Art to Start (Designing Prompts for GPT-3).
It is how we design prompts and how that interacts with the training data we have fed into LLMs that most interests me. By exploring the human memeplexes that can be uncovered in the training data using carefully designed prompts I believe we can use these LLMs to reflect back to us our hidden biases and perspectives.
It is the intersection between personal biases encoded into prompts and the cultural biases encoded into training data that I believe we can find a useful answer to some of the ethical problems of using LLMs. Additionally, I believe that we can use LLMs in this way to better understand ourselves both as individuals and as societies and cultures.