160 Kemp House, City Road, London EC1V 2NX 812 (345) 6789 Support@webart-studio.com
Salesforce researchers release framework to test NLP model robustness

In the sub-area of ​​machine learning known as Natural Language Processing (NLP), robustness tests are more the exception than the norm. This is particularly problematic given the work that shows that many NLP models use fake connections that affect their performance outside of certain tests. One report found that 60% to 70% of the responses from NLP models were embedded somewhere in the benchmark training sets, indicating that the models typically just learned answers by heart. Another study – a meta-analysis of over 3,000 AI articles – found that metrics used to benchmark AI and machine learning models tended to be inconsistent, erratically tracked, and not particularly informative.

This motivated Nazneen Rajani, a senior scientist at Salesforce who leads the company’s NLP group, to create an ecosystem for assessing the robustness of machine learning models. Together with Stanford Associate Professor of Computer Science Christopher Ré and the University of North Carolina at Mohit Bansal in Chapel Hill, Rajani and the team developed the Robustness Gym, which aims to unify the patchwork of existing robustness libraries in order to accelerate the development of novel NLP model test strategies .

“While existing robustness tools implement specific strategies such as adversarial attacks or template-based extensions, the Robustness Gym provides a single point of contact to execute and compare a wide range of assessment strategies,” Rajani told VentureBeat via email. “We hope that Robustness Gym will make robustness testing a standard part of the machine learning pipeline.”

Above: The front-end dashboard for Robustness Gym.

Photo credit: Salesforce Research

The Robustness Gym provides guidance to practitioners on how key variables – their role, assessment needs, and resource constraints – can help prioritize the assessments to be performed. The suite describes the influence of a specific task via a structure and known previous ratings. Needs like testing generalization, fairness, or security; and limitations such as expertise, computer access, and personnel.

Robustness Gym divides all robustness tests into four evaluation idioms: subpopulations, transformations, evaluation sets and enemy attacks. Practitioners can create so-called slices, with each slice defining a collection of evaluation examples that have been created using one or a combination of evaluation languages. Users are set up in a simple, two-step workflow that separates the storage of structured page information on examples from the nuts and bolts of programmatically creating slices using that information.

Robustness Gym also consolidates slices and results for prototyping, iteration, and collaboration. Practitioners can organize slices in a test bench that can be versioned and shared so that a community of users can collaborate to benchmark and track progress. For reporting purposes, Robustness Gym offers standard and custom robustness reports that can be automatically generated by test benches and included in paper attachments.

Salesforce Robustness Gym

Above: The named entity that links the performance of commercial APIs versus academic models to Robustness Gym.

Photo credit: Salesforce Research

In one case study, Rajani and co-authors had a sentiment modeling team at a “big tech company” that measured the bias of their model using subpopulations and transformations. After testing the system on 172 layers, which include three assessment languages, the modeling team found performance degradation of up to 18% on 16 layers.

In a more revealing test, Rajani and his team used Robustness Gym to assign commercial NLP APIs from Microsoft (Text Analytics API), Google (Cloud Natural Language API) and Amazon (Comprehend API) with the open source systems BOOTLEG, WAT and REL compare across two benchmark data sets for linking named entities. (Linking named entities involves identifying the key elements in a text, such as names of people, places, brands, monetary values, and more.) They found that the commercial systems struggled to link rare or less popular entities and were sensitive responded to the capitalization of entities and often ignored contextual cues when making predictions. Microsoft outperformed other commercial systems, but BOOTLEG outperformed the rest in terms of consistency.

“Both Google and Microsoft have performed well on some topics, such as: B. Google on the subject of “Alpine Sports” and Microsoft on the subject of “Skating”. [but] Commercial systems circumvent the difficult problem of ambiguity of unique units in favor of returning the more popular answer, ”wrote Rajani and co-authors in the paper describing their work. “Overall, our results suggest that state-of-the-art academic systems significantly outperform commercial APIs for linking named entities.”

Salesforce Robustness Gym

Above: The summary performance of models compared to Robustness Gym.

Photo credit: Salesforce Research

In a final experiment, Rajani’s team implemented five subpopulations that capture aggregate abstraction, content distillation, positional distortion, information scattering, and information rearrangement. After comparing seven NLP models, including Google’s T5 and Pegasus, in an open source summary dataset across these subpopulations, the researchers found that the models required higher levels of abstraction or more references to entities for examples that were highly distilled have had difficulty performing well. Surprisingly, models with different prediction mechanisms appeared to have “strongly correlated” errors, suggesting that existing metrics cannot capture significant performance differences.

“With the Robustness Gym we are showing that robustness remains a challenge even for corporate giants such as Google and Amazon,” said Rajani. “In particular, we show that public APIs from these companies perform significantly worse than simple string-matching algorithms for the task of disambiguating entities when evaluating rare (end) entities.”

Both the above paper and the Robustness Gym source code are available as of today.

VentureBeat

VentureBeat’s mission is to be a digital city square for tech decision makers to gain knowledge of transformative technology and transactions. Our website provides important information on data technologies and strategies to help you run your business. We invite you to become a member of our community and access:

  • current information on the topics of interest to you
  • our newsletters
  • gated thought leader content and discounted access to our valuable events like Transform
  • Network functions and more

become a member

Leave a Reply