We might not be able to control ChatGPT, but we can control the data it consumes
The UK should lead the way on regulating generative AI, and it should focus on the data used by systems like ChatGPT to do so. After all if the data is biased, so is the final product, writes Sir Nigel Shadbolt
The rise of generative AI has been the story of the year in the technology world. ChatGPT has received the lion’s share of media attention, with Google Bard, Stable Diffusion and DALL-E close behind. These systems generate new outputs based on the data they have been trained on. Traditional AI is designed to recognise patterns and make predictions; generative AI creates new content in the form of text, code, images and audio.
These systems have exceeded expectations because of the sheer scale of the data they are trained on and the size of the resulting models. They’re often described as large language models or foundation models. a These and other AI tools have the potential to increase global GDP by 14 per cent – $15.7tn – by 2030, according to the Center for Strategic and International Studies.
In the UK, this technology generated £3.7bn for the country in 2022. In fact the UK is home to one-third of Europe’s AI businesses.
When the power of these AI systems first became a major news story, there was initially shock and awe at their processing abilities, scale and speed. This was quickly followed by worries over the ways they might impact the future of work as we know it. Then there was concern about how easily we may be fooled by the AIs and the misinformation they could produce. This was all neatly summed up by an AI-generated photograph of the Pope in a puffa jacket.
But data drives all of these AI systems. The emergence of all of modern machine learning depends on vast amounts of data. Technology will undoubtedly advance, but data remains essential as we engineer more capable AI. Building open and trustworthy data ecosystems becomes increasingly important, an arena in which organisations like the Open Data Institute have a crucial role to play. After all, the economic potential of AI shows that much more is at stake than amusing images or homework shortcuts. We need to get the data end of this technology right or risk losing out on its potential.
Sometimes glaring errors or seeming falsehoods come from supervised trials of systems using reinforcement learning, like ChatGPT. Obtaining, curating and processing data efficiently and ethically is expensive. If we rely on these tools to assist in matters as serious as medical procedures or building design, AIs need to work with facts rather than fiction. The costs involved create an uneven playing field between large tech companies and small developers.
We need more openness and understanding around these data sources. It also points to the need for data skills and knowledge across the population. This will help people understand where AI is at work, the data it is working with and how we may best challenge what it produces. We know that existing data sets can contain implicit biases – be they based on gender, ethnicity, sexuality or geography – which means that the result of using these can only be more biases.
Well-curated open data has an important role here too. Good quality data matters: for example Wikipedia, whilst only a few percent of the total data ingested in the largest language models, has an outsized impact on the quality of output. Open data is the best foundation for transparency, accountability and understanding.
We must ensure that communities and countries are not left behind or harmed by data’s inequitable use and the technologies it drives. This means being aware of – and addressing – power asymmetries, to give the under-represented a voice rather than allowing bias to erode it.
The UK government is aware of these issues and has backed Sir Patrick Vallance’s idea of a regulatory “sandbox” to develop new rules for working with these fast-moving technologies. How this sandbox will come to fruition remains to be seen.
When it comes, regulation will need to be coherent and cohesive, with centrally-resourced UK regulation that is aligned globally, to avoid duplication or loopholes. Arguments on the case for regulation will continue to range from those who shout “too little too late” to those who believe the technology should be allowed to develop unfettered.
The government has set out its key principles for the responsible development of AI: safety, transparency, fairness, accountability and contestability. It’s not clear, however, what they mean in practice for the sector.
Our politicians have been quick to consider the economic and technological potential of AI. But the proof of their intention to capture the value of the technology and mitigate its harmful effects will be in how seriously they are willing to back regulation of AI, and a robust, supporting data infrastructure.
The future of AI may look unnerving and uncertain when most of the focus is on the risks rather than the rewards. But there are enormous opportunities. To realise this promise, the data that powers our AI must be central to the conversation.