Powerful large language models (LLMs) now play essential roles in many real-world applications. But as humans become increasingly dependent on LLMs, some are questioning whether or to what extent we can trust them to deliver the “truth.”
In the new paper Discovering Latent Knowledge in Language Models Without Supervision, a research team from UC Berkeley and Peking University presents Contrast-Consistent Search (CCS), an unsupervised approach for discovering latent knowledge in language models.
The research team argues there are a number of ways conventional LLMs can become “misaligned with the truth.” If a model is trained via imitation learning, it may simply adopt the human demonstrators’ inefficiencies and errors. If a model’s outputs are rated by humans (reward optimization), the texts may be coherent and compelling, but errors that humans can’t detect may get through.
To circumvent these issues, instead of using explicit truth, the team focuses on models’ learned implicit, internal “beliefs” or “knowledge”, which it realizes through the introduction of Contrast-Consistent Search (CCS), a novel approach designed to accurately detect and reveal knowledge from model representations.
The proposed CCS aims at finding a direction in the activation space that is consistent across negations — i.e., that satisfies logical consistency properties such as that a statement and its negation have opposite truth values. The CCS workflow comprises four steps: 1) Answer questions with ‘yes’ or ‘no’, 2) Compute the representation of each answer, 3) Map the answer representations to probabilities of being true, and 4) Optimize that mapping to make the probabilities both consistent and confident.
In their empirical study, the team evaluated CCS across six models and ten question-answering datasets. The results show that CCS surpasses strong zero-shot baselines by an average of four percent and can cut prompt sensitivity in half while maintaining high accuracy — even if the prompted answers are incorrect.
This work demonstrates the potential of using unsupervised approaches to solve false text output issues in LLMs. The team sees their method as an initial step toward discovering latent knowledge when explicit ground truth labels are unavailable.
The paper Discovering Latent Knowledge in Language Models Without Supervision is on arXiv.
Author: Hecate He | Editor: Michael Sarazen
We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.