Why it's impossible to review AIs, and why TechCrunch is doing it anyway

Why it’s impossible to criticize AI and why E.S News does it anyway

The Artificial Intelligence (AI) landscape is growing rapidly, with new models unveiled virtually every week. However, this rapid progress raises many challenges, including that of evaluating these AI models. Existing assessment frameworks struggle to keep pace with the evolution and scale of AI systems.

Unmarked and constantly updated, these AI configurations are difficult to judge completely and consistently. The results of synthetic benchmarks only contribute to providing an abstract overview of well-defined capabilities. AI companies, such as Google and OpenAI, are banking on this challenge to emphasize their advantage, underscoring the need for consumers to rely solely on these companies’ claims.

AI models are too numerous, too vast and too opaque. The frequency of release of these models is so high that a serious assessment of their merits and weaknesses is a challenge. Each of these has a complex web of release levels, access requirements, platforms, codebases, and more.

However, these models are not simply pieces of software or hardware that one can quickly test and evaluate, like a device or cloud service. They are actually platforms, comprising dozens of individual models and services incorporated or added to them. As a result, the evaluation of these systems requires a qualitative study, which proves very valuable for consumers who, in this rich and constantly evolving context, seek to distinguish the true from the false.

Large companies keep their internal training methods and databases as trade secrets. Therefore, without visibility into these processes, it is difficult to evaluate them objectively. Companies provide non-apologia statements, but never actually invite us to peek behind the curtain.

The wide variety of tasks that an AI system may be asked to perform, including those that its creators did not anticipate, makes exhaustive testing impossible. Additionally, what can be tested, by whom and how, is constantly evolving. The field is chaotic, to put it mildly, but someone still has to act as arbiter.

A lire également  Python: a language designed for machine learning

At E.S News, bless the avalanche of AI nonsense we receive every day, we have decided to revise certain AI models. Consumers simply can’t trust what big companies say. They sell a product, or they package you to be a product. They will do and say anything to hide this fact. Therefore, we decided to carry out our own tests on the main models, in order to get this practical experience.

Identifying a range of qualities that users might find important, we use a series of tests to get a general sense of an AI’s capabilities. We test everything from their ability to update an evolving news story, to giving medical advice, to producing a specific product description, and more. We then share our experience so you can see how the models actually perform, not just what their benchmark score is.

However, there are some things we don’t do, like testing multimedia capabilities, asking a model to code, giving model “reasons” for tasks, trying integrations with other apps, attempting to jailbreak models, etc. .

Overall, the goal is to provide a general view of an AI’s capabilities without delving into elusive and unreliable details. Like any rapidly evolving industry, our approach must evolve too. We are committed to maintaining an up-to-date perspective.