Fellow at NYU ILI
Working title: Privacy of the Mind: Emotional Processing, Confidentiality, and the Role of Tech and Law in AI-mediated Self
Talk to us about your thoughts!
Having a chatbot mold their responses to you, ground you, fill in the blank, or otherwise reflect back your thoughts and feelings may sound dreary to some privacy scholars, but it is a force unstoppable. People increasingly use ChatGPT for emotional processing. This self-introspective process externalizes inner voice and outsources inner work, which until recently cannot be explicitly completed without the use of traditional instruments like personal diaries, trusted friends, or modern psychotherapists. Such an intimate activity is not always an individualistic endeavor, however, as institutions where we share our own intimate thoughts abound: confessions, circles, AA-meetings. What distinguishes diaries, friends, and therapists apart is a lack of intrusion of the Other. We may expect inner voice, when externalized, to still preserve a lack of judgment that allows for the flow of emotional expression.
The widespread use of ChatGPT for emotional processing therefore may hint at a privacy relevance unseen before. We ask, is this an emergent form of privacy in the age of AI that ought to be recognized and studied?
We share a philosophical and ethical basis for the intimate use case as an extension of or a witness to the self, or an interception to the becoming of self, map it to modern psychology and technological design, before comparing it with analogies like diaries. Lastly, we apply these analogies’ relevant law and regulations and attempt to contour the notion of Privacy of the Mind.
I am a research fellow at New York University's Information Law Institute, working with the wonderful law people on innovation policy and information privacy law.
I am resetting and seeing through pre-existing trains of thought. In particular, I am expanding on the safety impact both in theory (optimization -> privacy theory -> legal theory) and in practice (computer science->implementation->societal impact).
I am also expanding my community. Please find me for
I am still working on auditing and regulatory possibilities for bringing meaningful oversight for AI. The technical projects can be seen on the rest of the website. I do meetings 3 days a week during the day. Earlier is better.
Seeing that various issues labeled as "AI privacy" have become much less of a theoretical pursuit, I aim to bring impact of privacy technologies further into both regulation and actual use case.
Remark: an interesting framing is that I work on neartermist safety. AI's potential to take away people's agency is readily evident in many parts of the world. While most of my peers work on scaling large models, I catastrophize large scale societal harm. I am somewhat ok with this label, because it motivates me to take grounded approaches in the lofty subject that is AI safety.
We notice that medical domains have a dataset combination problem -- when seemingly "in-domain" data produced in another hospital is combined with the home hospital's data, the resulting machine learning model trained on this combination may not improve overall performance (for the source hospital). In other domains, this problem is maybe worse: Meta MSL researchers found Scale AI's data to be low quality for machine learning training, which may not have been obvious before.
This inherent uncertainty in "how can we know anything about unseen data" makes collaborations rather risky.
We look forward to presenting our solution to this problem at AAAI-AIES this October. Arxiv; to be updated
Delighted that our paper on evaluating external datasets privately and accurately has been accepted to the AAAI/ACM conference AI, Ethics and Society in 2025. I will see you in Spain in October!
My research focused on AI privacy, specifically on developing secure techniques to preserve data rights. This research explores the concept of a private data economy, where individuals retain control over their data while fostering a healthy market environment. By empowering individuals with the right to be forgotten and other data rights, we can establish sustainable AI governance for the future.
We observe that the ML community cannot sufficiently access privacy tech. People interested in safe ML topics such as model alignment often do not have a clear understanding of cryptography-based techniques when applied to ML.
On the other hand, cryptographers want to apply their knowledge, especially with practical security management, to machine learning systems. Yet they face the problem of a completely different mindset from machine learning researchers who currently drive the development of these systems.
These communities do not engage with each other, partly because they do not have a shared common ground. This tutorial can bridge the gap between cryptography and effective decentralized ML training and evaluation.
Join us at NeurIPS 2024 for a tutorial on Privacy ML: Meaningful-Privacy Preserving Machine Learning and How to Evaluate AI Privacy.
Slides and video Our website https://privacyml.github.io/
I am grateful to be invited to talk on Neartermist safety: incentive-compatible directions for large model oversight at Cryptography, Security and Multipolar Scenarios.
I had a wonderful time, especially in reconnecting with foresight's community that focuses on technical solutions for a brighter future. It also helped me connect to like-minded researchers like Dima and Fazl, with whom we developed a tutorial at NeurIPS.
It seems that some of the words I used and (unfortunately) did not get to define became co-opted by other attendees, so I feel the need to clarity this:
While uncertainties abound in longtermism, societal attitudes towards data and AI are predictably shifting in the near term. We present a scenario under intense market competition and widely spread antagonism towards AI development: open data has been depleted, new data is no longer free; people are guarding their data closely, and businesses are holding onto their training data and resulting models secretly. As AI advances with larger scales towards AGI, ad-hoc evaluations will cease to be useful. New knowledge testers will have little longevity, and will become more and more costly to build. In those cases, how can knowledge still be shared, and how can governments and organizations perform evaluations on frontier models, when collaboration becomes costly?
To this foreseeable problem, I present some underrated solutions (while dispelling common misbeliefs) on the intersection of security and AI, with oversight as a goal. I will use language understanding, privacy, and security as examples, as they are the areas with predictable scrutiny in the near term.
Through my research, I hope to mitigate the erosion of individual autonomy perpetuated by a data-fueled, AI-powered economy.
The development of AI exacerbates and emboldens the loss of individual control over data, yet at the same time, it requires vasts amounts of new data in order to scale and maintain control, especially in the worst-case scenarios. Data rights can be an important piece in the checks and balances in our future.
Remark: an interesting framing is that I work on neartermist safety. Weaponizing AI to take away people's agency is happening in some parts of the world -- certainly not longtermist. While most of my peers work on scaling large models, I catastrophize large scale societal harm. I am ok with this label, because it motivates me to take grounded approaches in the lofty subject that is AI safety.
In 2024, I am developing my PhD thesis on AI Privacy -- broadly towards preserving data rights with security techniques -- in the computer science department of NYU's Courant Institute of Mathematics.
Privacy is not the villain of the story of positive AI outcomes. A path forward I see involves 1. a private data economy that affords healthy market environments and 2. sustainable AI governance through individuals' right over their data, starting from enabling the right to forget.
I am advised by Prof. Leon Bottou.
I worked as an engineer at Google Chrome, Baidu Silicon Valley AI Lab, and UnifyID. I PhD-interned at Facebook AI Research, and TikTok Applied ML recommendations.
Over the past 5 years, I have been co-fostering an interdisciplinary research community on machine learning and systems at NeurIPS. We believe machine learning is impactful not just through replacing existing systems, but by fundamentally changing the ways we build new ones.
🔗CALL FOR PAPERS🔗 🔗ACCEPTED PAPERS🔗
Efficient and Exact Machine Unlearning from Bi-linear Recommendations.Paper draft.
My work on the Right To Be Forgotten for content recommenders (with ByteDance).
I am interested in secure and private evaluations on safety-critical tasks.
I'm working on integrating cutting-edge cryptography and security techniques into applications that may become future infrastructure. I have opinions on the techniques that afford better policy regarding our future relationship with AI.
I am additionally interested in both the generalizeable security of machine learning, and using machine learning to enhance systems security.
In 2020, I participated in what became the popular LLM benchmark Big-Bench (Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models). Together with Rowan Jacobs and James Koppel, the project Conlang Translation won a spotlight at the Workshop on Enormous Language Models.
Probing Pre-trained LLMs w/ Linguistic Puzzles (ICLR 21’ Big-Bench spotlight, big-bench paper, our slides).
In the past, I was part of a few exploratory workshops on privacy and machine learning.
Even though my research is on machine learning, I enjoy integrating cryptography for privacy-preserving machine learning to demo what can be done. Examples included on-device fitness analysis on smart mirrors, and secure algorithms for fair healthcare triage. The underlying security techniques can enrich areas in machine learning I deem likely infrastructural in our future, without sacrificing the bulk of economic opportunities and market competitiveness.
System Security: Learned Containment: Generated Sandbox by Repeated Fuzzing and Patching. Policy x Causal Inference: Election with Causality — A Mean Field Model and Turnout Simulation.
Though privacy is a focus on AI policy, solutions towards privacy are lacking behind. I am interested in techniques beyond differential privacy. This includes defining contextual privacy computationally, systemic re-identification risk assessment, membership inference, deleting training data (slides for talk at RIKEN).
AI Security (with generalization): Understanding augmentation and simulation, theory of domain randomization, defining ‘similar’ environments, learning in the presence of adversaries (talk slides).
🔗⬇️🔗 Mean-field analysis for causality in social media amplification. This abstracts social media recommenders to an influence on election outcomes.
🔗⬇️🔗 Towards secure transaction and fair pricing of training data. There are four parts to this work .
Learned containment: generated sandbox by repeated fuzzing and patching.
Besides these, I enjoy thinking about systems and design problems in technology. I would also love to hear your practical privacy challenges that relate to managing data and models.
Abstract In November 2016, when all the polls predicted a Clinton presidency, America got Trump. Assuming that the pollsters had sampled real sentiments from real voters, what could be the cause for the shift in election result? This is rather hard to untangle, since very few entities have the data on social networks' topology. There is also the challenge of assessing cyclic causes over time, akin to studying a network influence question: does the eating disorder community on the internet cause eating disorder? This work explores different high level aspects of network influence via a mean-field approach across average networks, controlling for variance of election outcomes away from the fundamentals. The goal of this exploration is to offer a simulation and theoretic model, a combination that clarifies qualitative phenomena and evaluates potential policy interventions with impact. 🔗Project intro page.🔗
This project has the broad goal of streamlining the incentives problems with data markets. There are three endeavors so far on tracking the challenge:
The goal is to evaluate the effect training data will have on a pre-trained model without retraining, implemented as a secure multi-party computation in CrypTen.
Main insight One single step of natural gradient descent may be more desirable than one (Euclidean) gradient descent step, when used to evaluate the relative effect of various batches of data. If the ranking information between datasets correlates with the test loss resulted from add-one-on training, this information can be used to approximate the price of data without training on it. An application of that is private pricing before transaction. If this model is small enough, it can be done in private without much numerical imprecision. (to appear)
Implementing smart contracts for the effect of training data is extremely expensive.
Data markets are crucial part of the future of our economy. Traditional market practices tend to be inefficient when applied to data as a commodity. Because 1. data can be replicated at low cost, and 2. information contained in data can be privacy-relevant.
On the other hand, the rise of AI application whose utility depend on data quantity and quality means that privately-held quality data becomes apparently valuable. This gives rise to a data market where acquisitions and ownership disputes will be frequent.
We envision new mechanisms that allows model owners to acquire data of appropriate utility, while allowing data owner to manage, edit, and delete their data.
This work is ongoing.
I wrote a 🔗survey paper🔗 that helps me understand 1) Mean Field Methods as applied to the theory of deep learning, and 2) a physicist-style modeling, which I applied to qualitative political science.
🔗HEalth: Privately Computing on Shared Healthcare Data🔗
Abstract Healthcare in the US is notoriously expensive and unsustainable. Many machine learning startups aim to simplify the process by bringing in automated expertise to hospitals on radiology, brain imaging, or cancer detection. On the other hand, it is well-known that hospitals cannot easily share data. Existing machine learning solutions tend to focus on training with data obtained through long-term collaboration because the private computation requires that the clients have done appropriate pre-processing. Effectively, they work around the data-sharing challenge in training rather than tackling it. As a result, their inference model is typically static, and does not adapt to changes continuously.
In the meantime, regulatory agencies need to uphold the ethics standards for healthcare professionals, but can only access hospital records in a delayed fashion. Despite their mandate to audit individual doctors, assess response rates across the nation, and manage epidemics, regulatory bodies lack the reach to access critical information needed to learn real-time decision-making. In these scenarios, training privately from the get-go is the desirable outcome. This is where homomorphic encryption advances are used. We present an actualized scenario with hospitals sharing records to compute anomalies and audit fairness with incentive-compatible deployment and operations. Three main contributions: 1. fairness auditing at scale without the need to approximate nonlinearity, 2. enabling continuous sharing of data without key refresh, which undercuts the need for pairwise contracts out of privacy concerns, even if new models are developed and applied, and 3. an instance of anomaly detection in security management is applied to finding causes and correlations that aide medical research, particularly for rare diseases, epidemic management, and chronic illness research. We simulate such system using Microsoft's SEAL library and argue that our system of novel key-sharing scheme and anomaly detection algorithms is well-suited for applying homomorphic encryption at scale.
A “magic mirror” made with two-way mirrors using purely on-device computations, allowing deployment into the most private spaces (e.g. as a bathroom mirror). It monitors skin conditions, and counts your morning jumping jacks.
NeurIPS Private and On-Device AI workshop 2018 report and demo videos.