The vast training sets that gobble up troves of data from the open web to inspire AI models may become smaller and more curated, bypassing copyright issues at the center of the legal debate over tools like ChatGPT.

OpenAI CEO Sam Altman says he believes future AI models will use less training data but of higher quality.

"One thing that I expect to start changing is these models will be able to take small amounts of higher quality data during their training process and and think harder about it and learn more," Altman said on a panel interview at the World Economic Forum in Davos on Thursday.

Current generative AI models rely on huge datasets to train their tools. And since the data comes from the internet, the material includes copyrighted works from creators who never consented for their content to be part of the training set. Some major publishers have agreed to license their content to AI companies. But others, including the New York Times, have sued AI companies for copyright infringement.

How courts will apply legacy copyright law to new AI technologies is perhaps the biggest challenge facing businesses like OpenAI.

Altman said during the interview that "Any one particular training source doesn't move the needle for us that much." For OpenAI and others, a model where publishers opt-out of the training process, excluding their content from the development process is one way for creators to protect themselves.

But Altman also envisions a new paradigm of smaller pools of data and one in which creators are compensated for helping to inspire AI tools. "We wont need the same massive amounts of training data," he said. "But what we want in any case is to find new economic models that work for the whole world including content owners."

He added: "If you teach our models, if you help provide the human feedback, I'd love to find new models for you to get paid based off the success of that."