29 October 2024

Large language models are hard to trust

Can AI tools in the form of large language models be trained so that they become useful for Swedish defence? This question is being investigated by researchers at FOI. One difficulty is that there is no established method for evaluating how well they perform.

A keyboard and it says "LLM" above

Large language models are incredibly demanding in terms of hardware resources. Image: Shutterstock/Boy Anthony.

A large language model (LLM) is an advanced AI system that uses deep learning and neural networks to learn a language’s patterns. The model is trained on large amounts of text data and can then calculate the probability of different word sequences. In this way, it can perform text-based tasks such as analysing or generating its own texts in a seemingly human-like manner.

LLMs are considered a breakthrough in artificial intelligence. A research group at FOI, working on behalf of the Swedish Armed Forces, has investigated how LLMs can be trained to function within a Swedish defence domain and whether the necessary investments would be justified.

“There’s been an incredible amount of hype around language models. ChatGPT has become hugely popular, and now language models are showing up everywhere in the media. These systems seem capable of doing much of what AI has promised for decades, things that were pure science fiction just a few years ago. Everyone wants to see how they can be used,” says Björn Pelzer, senior scientist in FOI’s Cyber Defence and C2 Technology Division and one of the authors of the report, Large Language Models in Defense: Challenges and Opportunities.

Trained on FOI’s memos

The researchers tested how an LLM could assist a security-policy analyst by summarising text, answering questions on security policy, and generating a text based on a list of key phrases. They used two different language models: the Llama model from the US company Meta and Mistral from the French firm Mistral AI. By using two different models, the researchers aimed to see if there were any notable differences between them.

“There’s so much happening right now, with new models coming out at least once a month, so we wanted to look at more than just one. The report isn’t a recommendation, but more of an attempt to see how you can even compare models and figure out which ones are better than others,” says Björn Pelzer.

The models were trained on 244 memos about security policy issued by FOI between 2010 and 2023.

“We wanted to train a language model on defence-related data to see how difficult it would be and whether there’s any benefit to doing so. So we used an example where we had relatively good data available. Our memos have a fairly consistent format: they always follow roughly the same structure, cover a similar topic area, and aren’t very long, which is helpful because language models have a limit to how much text they can process at once,” Björn Pelzer explains.

The models hallucinate

The training texts were in both Swedish and English. Llama and Mistral are primarily trained in English, and the researchers wanted the LLMs to also write well in Swedish.

The experiment led to several conclusions. One was that the models did indeed improve their Swedish when trained on Swedish texts, according to Björn Pelzer.

“And they got better by writing continuously in Swedish. Untrained models can manage some Swedish, but after a while, they switch to English. The trained models were better at sticking to Swedish and picking up the memo style.”
The other major conclusion was that comparing LLMs is extremely difficult.

“We probably need to develop a common defence benchmark, some objective testing methods to test all these new models that keep coming out. There are similar benchmarks, but they’re for general systems, not specifically for defence-related issues,” says Björn Pelzer.

At present, it would basically require experts to ask the models a set of standardised questions and then assess their answers, he adds. It’s hard to trust the models.

“They can do things that were completely impossible just a few years ago, but they also often make very silly mistakes. I don’t think you can say they understand what they’re writing. They string together words based on probabilities, but it can turn out completely wrong. This phenomenon is called hallucinating, and it’s unclear whether it can actually be fixed.”

For instance, models might sometimes give a good answer to a question, but if the same question is rephrased just slightly, the answer can suddenly be completely wrong.”

Commercial interests driving the hype

Another conclusion is that gathering data to train LLMs is quite demanding. The FOI memos used were in pdf format, and for each memo, the text had to be extracted and manually cleaned of unnecessary elements such as page numbers and tables.

“At first, we thought we’d quickly train our models and then properly test them. But it turned out to be much harder than we expected to get the data the way we wanted. Another issue is how incredibly demanding they are in terms of hardware resources. We were working with relatively small language models. If you want to run the largest ones — the better they get, but also the harder they are to actually manage,” says Björn Pelzer.

Is it worthwhile for the Swedish Armed Forces to invest in the use of LLMs? That’s a complex question, according to Björn Pelzer. Due to their rapid development and the likelihood that potential adversaries are also working with them, it’s important to keep up. On the other hand, it’s crucial to remember that significant commercial interests are behind the hype surrounding large language models, he points out.

“One shouldn’t be too confident that things will turn out as well as these companies claim. The idea that we’ll have computers smarter than humans in just a few years — I’m very sceptical about that. Personally, I don’t believe language models will achieve that.”