mimee.xyz

Logo
Mimee Xu

PhD Student at New York University

View My GitHub Profile

The Role of Data in Securing Our AI Future

Through my research, I hope to mitigate the erosion of individual autonomy perpetuated by a data-fueled, AI-powered economy.

The development of AI exacerbates and emboldens the loss of individual control over data, yet at the same time, it requires vasts amounts of new data in order to scale and maintain control, especially in the worst-case scenarios. Data rights can be an important piece in the checks and balances in our future.

Remark: an interesting framing is that I work on neartermist safety. Weaponizing AI to take away people's agency is happening in some parts of the world -- certainly not longtermist. While most of my peers work on scaling large models, I catastrophize large scale societal harm. I am ok with this label, because it motivates me to take grounded approaches in the lofty subject that is AI safety.

Privacy, Security, and Machine Learning

In 2024, I am developing my PhD thesis on AI Privacy -- broadly towards preserving data rights with security techniques -- in the computer science department of NYU's Courant Institute of Mathematics.

Privacy is not the villain of the story of positive AI outcomes. A path forward I see involves 1. a private data economy that affords healthy market environments and 2. sustainable AI governance through individuals' right over their data, starting from enabling the right to forget.

I am advised by Prof. Leon Bottou.

I worked as an engineer at Google Chrome, Baidu Silicon Valley AI Lab, and UnifyID. I PhD-interned at Facebook AI Research, and TikTok Applied ML recommendations.


Machine Learning for Systems (MLforSystems)

Over the past 5 years, I have been co-fostering an interdisciplinary research community on machine learning and systems at NeurIPS. We believe machine learning is impactful not just through replacing existing systems, but by fundamentally changing the ways we build new ones.

🔗CALL FOR PAPERS🔗 🔗ACCEPTED PAPERS🔗


PhD Research

Netflix and Forget

Efficient and Exact Machine Unlearning from Bi-linear Recommendations.Paper draft.

My work on the Right To Be Forgotten for content recommenders (with ByteDance).

Data Appraisal Without Data Sharing 🔗🔗


Of Ongoing Interests

I am starting an organization for secure and private evaluations on safety-critical tasks. Stay tuned, and reach out if you are interested.

I'm working on integrating cutting-edge cryptography and security techniques into applications that may become future infrastructure. I have opinions on the techniques that afford better policy regarding our future relationship with AI.

I am additionally interested in both the generalizeable security of machine learning, and using machine learning to enhance systems security.

Large Model Evaluations

In 2020, I participated in what became the popular LLM benchmark Big-Bench (Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models). Together with Rowan Jacobs and James Koppel, the project Conlang Translation won a spotlight at the Workshop on Enormous Language Models.

Probing Pre-trained LLMs w/ Linguistic Puzzles (ICLR 21’ Big-Bench spotlight, big-bench paper, our slides).

Private ML and Confidential Computing

In the past, I was part of a few exploratory workshops on privacy and machine learning.

Even though my research is on machine learning, I enjoy integrating cryptography for privacy-preserving machine learning to demo what can be done. Examples included on-device fitness analysis on smart mirrors, and secure algorithms for fair healthcare triage. The underlying security techniques can enrich areas in machine learning I deem likely infrastructural in our future, without sacrificing the bulk of economic opportunities and market competitiveness.

System Security: Learned Containment: Generated Sandbox by Repeated Fuzzing and Patching. Policy x Causal Inference: Election with Causality — A Mean Field Model and Turnout Simulation.

Technical Aid To AI Policy

Though privacy is a focus on AI policy, solutions towards privacy are lacking behind. I am interested in techniques beyond differential privacy. This includes defining contextual privacy computationally, systemic re-identification risk assessment, membership inference, deleting training data (slides for talk at RIKEN).

Security Through Generalization Evals

AI Security (with generalization): Understanding augmentation and simulation, theory of domain randomization, defining ‘similar’ environments, learning in the presence of adversaries (talk slides).

  1. 🔗⬇️🔗 Mean-field analysis for causality in social media amplification. This abstracts social media recommenders to an influence on election outcomes.

  2. 🔗⬇️🔗 Towards secure transaction and fair pricing of training data. There are four parts to this work .

  3. Learned containment: generated sandbox by repeated fuzzing and patching.

Besides these, I enjoy thinking about systems and design problems in technology. I would also love to hear your practical privacy challenges that relate to managing data and models.


A Model for Network Influence

Abstract In November 2016, when all the polls predicted a Clinton presidency, America got Trump. Assuming that the pollsters had sampled real sentiments from real voters, what could be the cause for the shift in election result? This is rather hard to untangle, since very few entities have the data on social networks' topology. There is also the challenge of assessing cyclic causes over time, akin to studying a network influence question: does the eating disorder community on the internet cause eating disorder? This work explores different high level aspects of network influence via a mean-field approach across average networks, controlling for variance of election outcomes away from the fundamentals. The goal of this exploration is to offer a simulation and theoretic model, a combination that clarifies qualitative phenomena and evaluates potential policy interventions with impact. 🔗Project intro page.🔗


Towards Fair Pricing and Secure Transaction of Training data.

This project has the broad goal of streamlinining the incentives problems with data markets. There are three endeavers so far on trackling the challenge:

Data Appraisal without Data Sharing

The goal is to evaluate the effect training data will have on a pre-trained model without retraining, implemented as a secure multi-party computation in CrypTen.

Main insight One single step of natural gradient descent may be more desirable than one (Euclidean) gradient descent step, when used to evaluate the relative effect of various batches of data. If the ranking information between datasets correlates with the test loss resulted from add-one-on training, this information can be used to approximate the price of data without training on it. An application of that is private pricing before transaction. If this model is small enough, it can be done in private without much numerical imprecision. (to appear)

Data Efficacy

Implementing smart contracts for the effect of training data is extremely expensive.

  1. Studied the extracted models as a potential auxiliary function for enabling encrypted transactions, because they are not as secretive as the original models due to poor accuracy, yet may be used to evaluate data acquisition, enabling smart contracts.
  2. Studied the effect of the gradients of compressed models used in private to gauge data value.

Mechanism design for acquiring and deleting training data (w/ Qingyun Sun)

Data markets are crucial part of the future of our economy. Traditional market practices tend to be inefficient when applied to data as a commodity. Because 1. data can be replicated at low cost, and 2. information contained in data can be privacy-relevant.

On the other hand, the rise of AI application whose utility depend on data quantity and quality means that privately-held quality data becomes apparently valuable. This gives rise to a data market where acquisitions and ownership disputes will be frequent.

We envison new mechanisms that allows model owners to acquire data of appropriate utility, while allowing data owner to manage, edit, and delete their data.

This work is ongoing.


Survey on Mean Field Methods In Deep Learning

I wrote a 🔗survey paper🔗 that helps me understand 1) Mean Field Methods as applied to the theory of deep learning, and 2) a physicist-style modeling, which I applied to qualitative political science.


A Worry-free Encryption Plan for Healthcare Data

🔗HEalth: Privately Computing on Shared Healthcare Data🔗

Abstract Healthcare in the US is notoriously expensive and unsustainable. Many machine learning startups aim to simplify the process by bringing in automated expertise to hospitals on radiology, brain imaging, or cancer detection. On the other hand, it is well-known that hospitals cannot easily share data. Existing machine learning solutions tend to focus on training with data obtained through long-term collaboration because the private computation requires that the clients have done appropriate pre-processing. Effectively, they work around the data-sharing challenge in training rather than tackling it. As a result, their inference model is typically static, and does not adapt to changes continuously.

On the other hand, regulatory agencies need to uphold the ethics standards for healthcare professionals, but can only access hospital records in a delayed fashion. Despite their mandate to audit individual doctors, assess response rates across the nation, and manage epidemics, regulatory bodies lack the reach to access critical information needed to learn real-time decision-making. In these scenarios, training privately from the get-go is the desirable outcome. This is where homomorphic encryption advances are used. We present an actualized scenario with hospitals sharing records to compute anomalies and audit fairness with incentive-compatible deployment and operations. Three main contributions: 1. fairness auditing at scale without the need to approximate nonlinearity, 2. enabling continuous sharing of data without key refresh, which undercuts the need for pairwise contracts out of privacy concerns, even if new models are developed and applied, and 3. an instance of anomaly detection in security management is applied to finding causes and correlations that aide medical research, particularly for rare diseases, epidemic management, and chronic illness research. We simulate such system using Microsoft's SEAL library and argue that our system of novel key-sharing scheme and anomaly detection algorithms is well-suited for applying homomorphic encryption at scale.


Private ML Demo

A “magic mirror” made with two-way mirrors using purely on-device computations, allowing deployment into the most private spaces (e.g. as a bathroom mirror). It monitors skin conditions, and counts your morning jumping jacks.

NeurIPS Private and On-Device AI workshop 2018 report and demo videos.


About Me

I work on the intersection of machine learning and security.

So far, this includes 1. applying machine learning to practical security and privacy problems, which I witnessed in industry and 2. security management insights to similar challenges in deploying machine learning models, but I am open to other approaches.

For inquiries, suggestions, and threats, please email to mimee AT nyu.edu.