mimee.xyz

Logo
Mimee Xu

PhD Student at New York University

View My GitHub Profile

AI Privacy, Security, and Machine Learning

In 2023, I am developing my PhD thesis on AI Privacy -- broadly towards data economy and the governance on machine learning and AI -- in the computer science department of NYU's Courant Institute of Mathematics.

I worked as an engineer at Google Chrome, Baidu Silicon Valley AI Lab, and UnifyID. I PhD-interned at Facebook AI Research, and TikTok Recommendations.


Machine Learning for Systems

Over the past 5 years I have been involved in fostering an interdiscplinary research community on machine learning and systems at NeurIPS.

đź”—CALL FOR PAPERSđź”— đź”—ACCEPTED PAPERSđź”—


PhD Research

Netflix and Forget Efficient and Exact Machine Unlearning from Bi-linear Recommendations (with ByteDance) đź”—đź”—

Data Appraisal Without Data Sharing đź”—đź”—


Past Projects

  1. 🔗⬇️🔗 Mean-field analysis for causality in social media amplification. This abstracts social media recommenders to an influence on election outcomes.

  2. 🔗⬇️🔗 Towards secure transaction and fair pricing of training data. There are four parts to this work .

  3. Learned containment: generated sandbox by repeated fuzzing and patching.

Besides these, I enjoy thinking about systems and design problems in technology. I would also love to hear your practical privacy challenges that relate to managing data and models.


A Model for Network Influence

Abstract In November 2016, when all the polls predicted a Clinton presidency, America got Trump. Assuming that the pollsters had sampled real sentiments from real voters, what could be the cause for the shift in election result? This is rather hard to untangle, since very few entities have the data on social networks' topology. There is also the challenge of assessing cyclic causes over time, akin to studying a network influence question: does the eating disorder community on the internet cause eating disorder? This work explores different high level aspects of network influence via a mean-field approach across average networks, controlling for variance of election outcomes away from the fundamentals. The goal of this exploration is to offer a simulation and theoretic model, a combination that clarifies qualitative phenomena and evaluates potential policy interventions with impact. đź”—Project intro page.đź”—


Towards Fair Pricing and Secure Transaction of Training data.

This project has the broad goal of streamlinining the incentives problems with data markets. There are three endeavers so far on trackling the challenge:

Data Appraisal without Data Sharing

The goal is to evaluate the effect training data will have on a pre-trained model without retraining, implemented as a secure multi-party computation in CrypTen.

Main insight One single step of natural gradient descent may be more desirable than one (Euclidean) gradient descent step, when used to evaluate the relative effect of various batches of data. If the ranking information between datasets correlates with the test loss resulted from add-one-on training, this information can be used to approximate the price of data without training on it. An application of that is private pricing before transaction. If this model is small enough, it can be done in private without much numerical imprecision. (to appear)

Data Efficacy

Implementing smart contracts for the effect of training data is extremely expensive.

  1. Studied the extracted models as a potential auxiliary function for enabling encrypted transactions, because they are not as secretive as the original models due to poor accuracy, yet may be used to evaluate data acquisition, enabling smart contracts.
  2. Studied the effect of the gradients of compressed models used in private to gauge data value.

Mechanism design for acquiring and deleting training data (w/ Qingyun Sun)

Data markets are crucial part of the future of our economy. Traditional market practices tend to be inefficient when applied to data as a commodity. Because 1. data can be replicated at low cost, and 2. information contained in data can be privacy-relevant.

On the other hand, the rise of AI application whose utility depend on data quantity and quality means that privately-held quality data becomes apparently valuable. This gives rise to a data market where acquisitions and ownership disputes will be frequent.

We envison new mechanisms that allows model owners to acquire data of appropriate utility, while allowing data owner to manage, edit, and delete their data.

This work is ongoing.


Survey on Mean Field Methods In Deep Learning

I wrote a đź”—survey paperđź”— that helps me understand 1) Mean Field Methods as applied to the theory of deep learning, and 2) a physicist-style modeling, which I applied to qualitative political science.


A Worry-free Encryption Plan for Healthcare Data

đź”—HEalth: Privately Computing on Shared Healthcare Datađź”—

Abstract Healthcare in the US is notoriously expensive and unsustainable. Many machine learning startups aim to simplify the process by bringing in automated expertise to hospitals on radiology, brain imaging, or cancer detection. On the other hand, it is well-known that hospitals cannot easily share data. Existing machine learning solutions tend to focus on training with data obtained through long-term collaboration because the private computation requires that the clients have done appropriate pre-processing. Effectively, they work around the data-sharing challenge in training rather than tackling it. As a result, their inference model is typically static, and does not adapt to changes continuously.

On the other hand, regulatory agencies need to uphold the ethics standards for healthcare professionals, but can only access hospital records in a delayed fashion. Despite their mandate to audit individual doctors, assess response rates across the nation, and manage epidemics, regulatory bodies lack the reach to access critical information needed to learn real-time decision-making. In these scenarios, training privately from the get-go is the desirable outcome. This is where homomorphic encryption advances are used. We present an actualized scenario with hospitals sharing records to compute anomalies and audit fairness with incentive-compatible deployment and operations. Three main contributions: 1. fairness auditing at scale without the need to approximate nonlinearity, 2. enabling continuous sharing of data without key refresh, which undercuts the need for pairwise contracts out of privacy concerns, even if new models are developed and applied, and 3. an instance of anomaly detection in security management is applied to finding causes and correlations that aide medical research, particularly for rare diseases, epidemic management, and chronic illness research. We simulate such system using Microsoft's SEAL library and argue that our system of novel key-sharing scheme and anomaly detection algorithms is well-suited for applying homomorphic encryption at scale.


Private ML Demo

A “magic mirror” made with two-way mirrors using purely on-device computations, allowing deployment into the most private spaces (e.g. as a bathroom mirror). It monitors skin conditions, and counts your morning jumping jacks.

NeurIPS Private and On-Device AI workshop 2018 report and demo videos.


About Me

I work on the intersection of machine learning and security.

So far, this includes 1. applying machine learning to practical security and privacy problems, which I witnessed in industry and 2. security management insights to similar challenges in deploying machine learning models, but I am open to other approaches.

For inquiries, suggestions, and threats, please email to mimee AT nyu.edu.