Skip to content

ΤΕΡΕΖΑΚΗΣ

Search
  • Strawman, scarecrow, whatever: I made it to Oz.
1984, Caveat Publicus

Browser usage tells 3rd parties more than you realize

June 12, 2017 Peter Terezakis

Private traits and attributes are predictable from digital records of human behavior      Michal Kosinskia,1, David Stillwella, and Thore Graepelb   Author Affiliations      Edited by Kenneth Wachter, University of California, Berkeley, CA, and approved February 12, 2013 (received for review October 29, 2012)      Abstract     Full Text     Authors & Info     Figures     SI     Metrics     Related Content     PDF     PDF + SI    Next Section Abstract  We show that easily accessible digital records of behavior, Facebook Likes, can be used to automatically and accurately predict a range of highly sensitive personal attributes including: sexual orientation, ethnicity, religious and political views, personality traits, intelligence, happiness, use of addictive substances, parental separation, age, and gender. The analysis presented is based on a dataset of over 58,000 volunteers who provided their Facebook Likes, detailed demographic profiles, and the results of several psychometric tests. The proposed model uses dimensionality reduction for preprocessing the Likes data, which are then entered into logistic/linear regression to predict individual psychodemographic profiles from Likes. The model correctly discriminates between homosexual and heterosexual men in 88% of cases, African Americans and Caucasian Americans in 95% of cases, and between Democrat and Republican in 85% of cases. For the personality trait “Openness,” prediction accuracy is close to the test–retest accuracy of a standard personality test. We give examples of associations between attributes and Likes and discuss implications for online personalization and privacy.      social networks computational social science machine learning big data data mining psychological assessment   A growing proportion of human activities, such as social interactions, entertainment, shopping, and gathering information, are now mediated by digital services and devices. Such digitally mediated behaviors can easily be recorded and analyzed, fueling the emergence of computational social science (1) and new services such as personalized search engines, recommender systems (2), and targeted online marketing (3). However, the widespread availability of extensive records of individual behavior, together with the desire to learn more about customers and citizens, presents serious challenges related to privacy and data ownership (4, 5).  We distinguish between data that are actually recorded and information that can be statistically predicted from such records. People may choose not to reveal certain pieces of information about their lives, such as their sexual orientation or age, and yet this information might be predicted in a statistical sense from other aspects of their lives that they do reveal. For example, a major US retail network used customer shopping records to predict pregnancies of its female customers and send them well-timed and well-targeted offers (6). In some contexts, an unexpected flood of vouchers for prenatal vitamins and maternity clothing may be welcome, but it could also lead to a tragic outcome, e.g., by revealing (or incorrectly suggesting) a pregnancy of an unmarried woman to her family in a culture where this is unacceptable (7). As this example shows, predicting personal information to improve products, services, and targeting can also lead to dangerous invasions of privacy.  Predicting individual traits and attributes based on various cues, such as samples of written text (8), answers to a psychometric test (9), or the appearance of spaces people inhabit (10), has a long history. Human migration to digital environment renders it possible to base such predictions on digital records of human behavior. It has been shown that age, gender, occupation, education level, and even personality can be predicted from people’s Web site browsing logs (11⇓⇓⇓–15). Similarly, it has been shown that personality can be predicted based on the contents of personal Web sites (16), music collections (17), properties of Facebook or Twitter profiles such as the number of friends or the density of friendship networks (18⇓⇓–21), or language used by their users (22). Furthermore, location within a friendship network at Facebook was shown to be predictive of sexual orientation (23).  This study demonstrates the degree to which relatively basic digital records of human behavior can be used to automatically and accurately estimate a wide range of personal attributes that people would typically assume to be private. The study is based on Facebook Likes, a mechanism used by Facebook users to express their positive association with (or “Like”) online content, such as photos, friends’ status updates, Facebook pages of products, sports, musicians, books, restaurants, or popular Web sites. Likes represent a very generic class of digital records, similar to Web search queries, Web browsing histories, and credit card purchases. For example, observing users’ Likes related to music provides similar information to observing records of songs listened to online, songs and artists searched for using a Web search engine, or subscriptions to related Twitter channels. In contrast to these other sources of information, Facebook Likes are unusual in that they are currently publicly available by default. However, those other digital records are still available to numerous parties (e.g., governments, developers of Web browsers, search engines, or Facebook applications), and, hence, similar predictions are unlikely to be limited to the Facebook environment.  The design of the study is presented in Fig. 1. We selected traits and attributes that reveal how accurate and potentially intrusive such a predictive analysis can be, including “sexual orientation,” “ethnic origin,” “political views,” “religion,” “personality,” “intelligence,” “satisfaction with life” (SWL), substance use (“alcohol,” “drugs,” “cigarettes”), “whether an individual’s parents stayed together until the individual was 21 y old,” and basic demographic attributes such as “age,” “gender,” “relationship status,” and “size and density of the friendship network.” Five Factor Model (9) personality scores (n = 54,373) were established using the International Personality Item Pool (IPIP) questionnaire with 20 items (25). Intelligence (n = 1,350) was measured using Raven’s Standard Progressive Matrices (SPM) (26), and SWL (n = 2,340) was measured using the SWL Scale (27). Age (n = 52,700; average, µ = 25.6; SD = 10), gender (n = 57,505; 62% female), relationship status (“single”/“in relationship”; n = 46,027; 49% single), political views (“Liberal”/“Conservative”; n = 9,752; 65% Liberal), religion (“Muslim”/“Christian”; n = 18,833; 90% Christian), and the Facebook social network information [n = 17,601; median size, Graphic = 204; interquartile range (IQR), 206; median density, Graphic = 0.03; IQR, 0.03] were obtained from users’ Facebook profiles. Users’ consumption of alcohol (n = 1,196; 50% drink), drugs (n = 856; 21% take drugs), and cigarettes (n = 1211; 30% smoke) and whether a user’s parents stayed together until the user was 21 y old (n = 766; 56% stayed together) were recorded using online surveys. Visual inspection of profile pictures was used to assign ethnic origin to a randomly selected subsample of users (n = 7,000; 73% Caucasian; 14% African American; 13% others). Sexual orientation was assigned using the Facebook profile “Interested in” field; users interested only in others of the same sex were labeled as homosexual (4.3% males; 2.4% females), whereas those interested in users of the opposite gender were labeled as heterosexual. Fig. 1.      In a new window       Download PPT   Fig. 1.  The study is based on a sample of 58,466 volunteers from the United States, obtained through the myPersonality Facebook application (www.mypersonality.org/wiki), which included their Facebook profile information, a list of their Likes (n = 170 Likes per person on average), psychometric test scores, and survey information. Users and their Likes were represented as a sparse user–Like matrix, the entries of which were set to 1 if there existed an association between a user and a Like and 0 otherwise. The dimensionality of the user–Like matrix was reduced using singular-value decomposition (SVD) (24). Numeric variables such as age or intelligence were predicted using a linear regression model, whereas dichotomous variables such as gender or sexual orientation were predicted using logistic regression. In both cases, we applied 10-fold cross-validation and used the k = 100 top SVD components. For sexual orientation, parents’ relationship status, and drug consumption only k = 30 top SVD components were used because of the smaller number of users for which this information was available.

Share this:

  • Share
  • Click to print (Opens in new window) Print
  • Click to email a link to a friend (Opens in new window) Email
  • Click to share on X (Opens in new window) X
  • Click to share on LinkedIn (Opens in new window) LinkedIn
  • Click to share on Facebook (Opens in new window) Facebook

Like this:

Like Loading...

Related

Post navigation

Previous PostSxSW-2017Next PostShivoham Dress 11-14-2018

TEREZAKIS

Search

Categories

Subscribe to Blog via Email

Recently:

  • Shivoham
  • Shivoham Dress 11-14-2018
  • Browser usage tells 3rd parties more than you realize

Brent Crude Oil

Categories

  • 3D
  • Air
  • Ansari - New Environmentalism
  • Art
  • Basic Analog Circuits
  • Bench
  • Computation and Fashion
  • Digital Imaging Reset
  • Earth
  • Energy
  • epiVA
  • Existence
  • Fire
  • Genomes
  • Granite Construction Corporation
  • Health
  • icm
  • Introduction to Computational Media
  • IT + Web
  • LENR
  • Liberty Quarry
  • Machines
  • Magnetic Fields
  • Man and Nature
  • News
  • NYC
  • NYU
  • pcomp
  • Petition
  • Photography
  • Physical Computing
  • Physics
  • Puppets and Performing Objecs
  • Quotes
  • Sacred Sites
  • Sacred Sky Sacred Earth
  • Sky
  • Something to be excited about (in a good way)
  • Something to be excited about (in not a good way)
  • Sustainable Energy
  • Technology
  • Tesla
  • Thesis
  • Us versus Them
  • Water

Applied Physics

  • Berkeley radiological air and water monitoring forum
  • Geiger Counter Circuits

Digital Imaging

  • Digital Imaging Reset

Links to classes

  • Applications
  • Circuits Resources
  • ICM
  • NYU Soft Matter
  • PyBossa
  • TEREZAKIS.com
  • Understanding Genomes

Physical Computing

  • First Sensor – Mouser
  • PCOMP

Physics

  • Cherenkov radiation
  • Conversions
  • DIY Gamma Scintillator
  • Neutron Source
  • Neutron Source
  • Vacuum

Studio

  • Conversions
  • Decimal to fractional equivalents
  • Screw Size to Decimal Equivalent Charts
  • Understanding Turbomolecular Pumps
  • Welch vacuum

Thesis

  • Cuyahoga River fire
  • Gamma-photon Radiation Detector
  • Geiger Counter Circuits
  • Silicon Photomultiplier for scintillator readout
  • smoke detectors/ion chambers

Tools

  • Cartography DataBase
  • Conversions

Vacuum

  • smoke detectors/ion chambers
June 2025
M T W T F S S
 1
2345678
9101112131415
16171819202122
23242526272829
30  
« Nov    

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Member of The Internet Defense League

Proudly powered by WordPress
 

Loading Comments...
 

You must be logged in to post a comment.

    %d