Artificial intelligence (AI) is advancing at an incredible pace, with recent developments such as ChatGPT and Bard quickly gaining popularity. Companies like Google and Microsoft are integrating generative AI into their products, while world leaders are embracing AI as a tool for economic growth. However, as we move beyond generic chatbots, AI systems are likely to become more specialized.
The key to improving AI chatbots lies in the data they are exposed to during training. Typically, AI systems absorb vast amounts of information from books and web pages to improve their ability to mimic human speech and provide useful answers. However, a more focused set of training data could make AI chatbots even more valuable for specific industries or geographical areas.
Data has immense value in the AI field. Companies like Meta and Google earn billions by selling advertisements targeted with user data, but the value of data is evolving. While it may not be valuable for targeted advertising, data is crucial for developers like OpenAI, which aims to create AI models that can produce human-like language. For instance, billions of tweets, blogposts, and Wikipedia entries help train advanced language models like GPT-4.
As the demand for sophisticated AI models grows, accumulating training data becomes increasingly costly. Organizations like OpenAI, Meta, and Google have invested in AI research and development for years to harness their data resources. Companies like X and Reddit are starting to charge third parties for API access to offset the rising costs of data scraping, which requires more computing power.
Synthetic data presents a potential solution to the increasing costs of data acquisition. Synthetic data is generated by AI systems designed to perform the same task as real training data, but obstacles stand in the way of its effectiveness. Synthetic data needs to be different enough from the original data to provide new insights while still being accurate. Additionally, training AI on synthetic data could lead to a decline in effectiveness, similar to the issues faced by inbreeding in the Hapsburg royal family.
Currently, chatbots like ChatGPT rely on reinforcement learning with human feedback (RLHF) to improve accuracy. However, if AI systems are trained on synthetic data with inaccuracies, the demand for human feedback to correct these errors will increase. Technical or specialized inaccuracies are less likely to be caught through RLHF, potentially leading to a decline in the quality of general-purpose language models.
These challenges have resulted in emerging trends in AI. Third parties can now recreate large language models such as GPT-3 or Google’s LaMDA, which allows organizations to build their own AI systems using specialized data for specific objectives. For instance, the Japanese government plans to develop a Japan-centric version of ChatGPT, and companies like SAP are offering AI development capabilities to professional organizations.
The future of AI may lie in many specific little language models rather than large ones. Little language models could be developed for specific purposes and benefit from valuable feedback from employees with expert knowledge of their organizations and objectives. While they may struggle with less training data, the focused feedback could compensate for this limitation.
As AI continues to evolve, we may see a shift in the business model for data-rich organizations. Rather than relying solely on generic language models, companies and organizations are exploring the potential of specialized AI systems tailored to their needs. The development of little language models could be the next step in the AI revolution, driven by the need for specific solutions and the challenges faced by generic models.
In summary, AI is rapidly advancing, and specialized chatbots are likely to replace generic ones in the future. The value of data is changing, with companies investing in AI research and development to capitalize on their data resources. Synthetic data presents a potential solution to the growing costs of data acquisition, although it comes with its own challenges. As AI models become more specialized, the future of AI may involve many specific little language models, tailored to individual organizations and purposes.