Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessNothing’s AI devices plan reportedly contains smart glasses and earbudsTechCrunchRuben Gallego Takes Aim At Marco Rubio Over Threat To Leave NATO: 'No Right To Take Us Out Of It'International Business TimesIndia says foreign investment gains made before 2017 are exempt from its General Anti-Avoidance Rules, after a court required Tiger to pay $1.6B on a 2018 sale (Reuters)TechmemeCathie Wood on OpenAI: We continue to serve as a bridge between private and public markets - CNBCGoogle News: OpenAIMemahami Dasar Web Development: Mengenal Frontend dan BackendDEV CommunityCombining the robot operating system with LLMs for natural-language controlPhys.org AICombining the robot operating system with LLMs for natural-language control - Tech XploreGoogle News: LLMEU bars AI-generated content from official communications, according to PoliticoThe DecoderI tested ChatGPT vs. Claude to see which is better - and if it's worth switchingZDNet AII tested ChatGPT vs. Claude to see which is better - and if it's worth switching - ZDNETGoogle News: ChatGPTOpenClaw AI Agent Framework: Run Autonomous AI on Your Own HardwareDEV CommunityForbes Daily: OpenAI Is Now Worth A Whopping $852 Billion - ForbesGoogle News: OpenAIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessNothing’s AI devices plan reportedly contains smart glasses and earbudsTechCrunchRuben Gallego Takes Aim At Marco Rubio Over Threat To Leave NATO: 'No Right To Take Us Out Of It'International Business TimesIndia says foreign investment gains made before 2017 are exempt from its General Anti-Avoidance Rules, after a court required Tiger to pay $1.6B on a 2018 sale (Reuters)TechmemeCathie Wood on OpenAI: We continue to serve as a bridge between private and public markets - CNBCGoogle News: OpenAIMemahami Dasar Web Development: Mengenal Frontend dan BackendDEV CommunityCombining the robot operating system with LLMs for natural-language controlPhys.org AICombining the robot operating system with LLMs for natural-language control - Tech XploreGoogle News: LLMEU bars AI-generated content from official communications, according to PoliticoThe DecoderI tested ChatGPT vs. Claude to see which is better - and if it's worth switchingZDNet AII tested ChatGPT vs. Claude to see which is better - and if it's worth switching - ZDNETGoogle News: ChatGPTOpenClaw AI Agent Framework: Run Autonomous AI on Your Own HardwareDEV CommunityForbes Daily: OpenAI Is Now Worth A Whopping $852 Billion - ForbesGoogle News: OpenAI

ML Safety Newsletter #3

newsletter.mlsafety.orgby Dan HendrycksMarch 8, 20221 min read0 views
Source Quiz

Transformer adversarial robustness, fractals, preference learning

Welcome to the 3rd issue of the ML Safety Newsletter. In this edition, we cover:

  • NeurIPS ML safety papers
  • experiments showing that Transformers have no edge for adversarial robustness and anomaly detection
  • a new method leveraging fractals to improve various reliability metrics
  • a preference learning benchmark
  • ... and much more.

This paper evaluates the distribution shift robustness and adversarial robustness of ConvNets and Vision Transformers (ViTs). Compared with previous papers, its evaluations are more fair and careful.

After controlling for data augmentation, they find that Transformers exhibit greater distribution shift robustness. For adversarial robustness, findings are more nuanced. First, ViTs are far more difficult to adversarially train. When successfully adversarially trained, ViTs are more robust than off-the-shelf ConvNets. However, ViTs’ higher adversarial robustness is explained by their smooth activation function, the GELU. If ConvNets use GELUs, they obtain similar adversarial robustness. Consequently, Vision Transformers are more robust than ConvNets to distribution shift, but they are not intrinsically more adversarially robust.

Paper

Video

PixMix improves both robustness (corruptions, adversaries, prediction consistency) and uncertainty estimation (calibration, anomaly detection).

PixMix is a data augmentation strategy that mixes training examples with fractals or feature visualizations; models then learn to classify these augmented examples. Whereas previous methods sacrifice performance on some reliability axes for improvements on others, this is the first to have no major reliability tradeoffs and is near Pareto-optimal.

Paper

A new adversarial robustness state-of-the-art by finding a better way to leverage data augmentations.

A highly effective gradient-based adversarial attack for text-based models.

A new benchmark for detecting adversarial text attacks.

Adversarially attacking language models with bidirectional and large-scale unidirectional language models.

First works on certified robustness under distribution shift: [1], [2], [3].

A dataset where in-distribution accuracy is negatively correlated with out-of-distribution robustness.

Improving performance in tail events by augmenting prediction pipelines with retrieval.

A set of new, more realistic 3D common corruptions.

Multimodality can dramatically improve robustness.

The authors model the hidden feature representations of in-distribution examples as class-conditional Gaussians, and they sample virtual outliers from the low-likelihood region. The model is trained to separate in-distribution examples from virtual outliers.

A path towards better out-of-distribution (OOD) detection is through generating diverse and unusual examples. As a step in that direction, this paper proposes to generate hidden representations or “virtual” examples that are outliers, rather than generate raw inputs that are outliers. The method is evaluated on many object detection and classification tasks, and it works well. It is not evaluated on the more difficult setting where anomalies are held-out classes from similar data generating processes. If the authors evaluated their CIFAR-10 model’s ability to detect CIFAR-100 anomalies, then we would have more of a sense of its ability to detect more than just far-from-distribution examples. Assuming no access to extra real outlier data, this method appears to be the state-of-the-art for far-from-distribution anomaly detection.

Paper

ML models can be “Trojans” and have hidden, controllable vulnerabilities. Trojan models behave correctly and benignly in almost all scenarios, but in particular circumstances (when a “trigger” is satisfied), they behave incorrectly. This paper demonstrates the simplicity of creating Trojan reinforcement learning agents that can be triggered to execute a secret, coherent, and undesirable procedure. They modify a small fraction of training observations without assuming any control over policy or reward. Future safety work could try to detect whether models are Trojans, detect whether a Trojan model is being triggered, or precisely reconstruct the trigger given the model.

Paper

The Species dataset contains over 700,000 images covering over 1,000 anomalous species.

While previous papers claimed that Transformers are better at OOD detection than ConvNets, it turns out their test-time “anomalous examples” were similar to examples seen during pretraining. How can we properly assess OOD detection performance for models pretrained on broad datasets? This paper creates a biological anomaly dataset with organisms not seen in broad datasets including ImageNet-22K. The OOD dataset shows that Transformers have no marked edge over ConvNets at OOD detection, and there is substantial room for improvement.

Paper

Detecting far-from-distribution examples by simply first clipping values in the penultimate layer.

A new OOD detection dataset with 224K classes.

A new metric advances the state-of-the-art for predicting a model’s performance on out-of-distribution data, assuming no access to ground truth labels.

A differentiable calibration loss sacrifices a small amount of accuracy for large calibration improvements.

In a thorough analysis of calibration, ConvNets are less calibrated than Transformers and MLP models, and more pretraining data has no consistent effect on calibration.

A dataset that can be used for detecting contradictions given long background contexts. Such a detector could be used for preventing models from stating falsehoods at odds with reality or their previous statements.

Factual knowledge in language models corresponds to a localized computation that can be directly edited.

Instead of assuming that the environment provides a (hand-engineered) reward, a teacher provides preferences between the agent’s behaviors, and the agent uses this feedback to learn the desired behavior.

Preference-based RL is a framework for teaching agents by providing preferences about their behavior. However, the research area lacks a commonly adopted benchmark. While access to human preferences would be ideal, this makes evaluation far more costly and slower, and it often requires navigating review board bureaucracies. This paper creates a standardized benchmark using simulated teachers. These simulated teachers have preferences, but they can exhibit various irrationalities. Some teachers skip queries, some exhibit no preference when demonstrations are only subtly different, some make random mistakes, and some overemphasize behavior at the end of the demonstration.

Paper

It is sometimes easier to identify preferences when decision problems are more uncertain.

Debate about “alignment” definitions: [1], [2], [3].

Today's optimal policies tend to be power-seeking, a failure mode that will become more concerning with future advanced AI.

Using model look-ahead to avoid safety constraint violations.

This work proposes a policy editor to make policies comply with safety constraints; experiments are based on Safety Gym.

Benchmarking policies that adhere to constraints specified via natural language.

Apply to Fathom Radiant which is working on hardware for safe machine intelligence.

No posts

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

transformersafety

Knowledge Map

Knowledge Map
TopicsEntitiesSource
ML Safety N…transformersafetynewsletter.…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 201 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models