The data needed for this study, to investigate the predictability of
digital data on the Big Five traits, will be collected by performing
meta-analyses to estimate the mean predictive value of it. An important aspect
is also whether data from different types of social media platforms lead to
different results, or to some extent influence the accuracy of the prediction,
so this shall be examined as well.//
KLINKT ALS WAT TE
DOEN//To evaluate the accuracy,
and recommend the best method(s) for the prediction of personality traits from
digital data from social media, the existing literature needs to be synthesized
The data needed for the meta-analyses will be gathered from 15 papers
with relevant studies on the relationship between the Big Five traits and
digital data. Databases such as Web of Science, Google Scholar and Scopus will
be used to conduct the literature search using groups of keywords decided
beforehand. Below are the keywords listed corresponding to either one of the
Social media platforms: facebook, twitter, Instagram,
snapchat, social media, youtube, linkedin
Analytic: data mining, text mining, digital data,
Personality traits: Big Five, Big 5, personality,
traits, openness to experience, conscientiousness, extraversion, agreeableness,
*Firstly, the presence of the above mentioned keywords will be sought
for in the abstracts and keywords sections of the papers.
The resulting scrap of papers will all be examined further by reading
their abstracts and judging them based on a couple of criteria determined
beforehand. For starters, the digital data must have been collected
automatically from the social media platforms. Furthermore, to be able to check
the Big Five personality traits, there must be a standardized self-report
measure. Lastly, information regarding the accuracy of prediction of the
personality traits based on digital data need to be reported.
When the research has non-independent data, basically meaning whenever
overlapping samples are being used, the study will be excluded. Criteria to
determine whether a study is non-independent are:
1. each effect size is
based on responses from overlapping subjects
2. digital data is
retrieved from the same social media platform
3. the kind of digital
data used for prediction is the same or overlapping (at least to some extent).
These criteria follow from recommendations from earlier studies 1 2.
Since there is heterogeneity in the type of data used in the studies, the
research methods, studies need to be coded based on inclusion of set of digital
data, based on the content. Studies including following types of digital data
will be included:
1. User demographics
(e.g. gender, age, race)
2. User activity
statistics (e.g. number of friends/posts/likes/etc.)
features (e.g. tweets, status updates, comments)
4. Pictures (e.g.
5. Multiple vs. single
type of digital footprints
Factors that may play a role in the accuracy of predicting the Big Five
personality traits, like default privacy settings of social media platforms
(e.g. public and private) will be grouped, to distinguish between different
types of social media platforms.
Due to the relative novelty and multidisciplinary
nature of the ex- amined research area, standard methodological procedures for
coding study quality have not yet been developed. For this reason, we could not
refer to specific guidelines to determine scientific quality of pub- lished
studies. As an approximation, study quality was assessed by classifying studies
based on the rank of the sources they were published in (i.e., peer-reviewed
journals and conference proceedings) according to well-known ranking systems of
scientific value. More in detail, we used a procedure which differed for
peer-reviewed journals and con- ference proceedings. Concerning articles
published in peer reviewed journals, we categorized papers into top, middle and
low tiers using the quartile that sources correspond to in the 2016 Scopus
CiteScore; quartile 1 was ranked as top tier or high quality, quartile 2 was ranked
as middle tier or medium quality, and quartiles 3, 4, and non-indexed studies
were ranked as low tier or low quality. In order to assess study quality of
proceedings from computer science conferences, we in- spected conference
ranking as reported in the CORE 2017 and Microsoft Academics databases, which
provide rankings of conferences in com- puter science based on their scientific
impact. We considered pro- ceedings as high-quality if at least one of the
databases rated the con- ference with an A (Excellent) score or higher,
proceedings with a score of B (Good) were ranked as medium quality, and those
with a score of C (ranked conferences meeting minimum standards) and unranked
con- ferences were marked as low quality.
Finally, for the papers whose abstracts meet the requirements, they will
be read thoroughly.
To determine how accurate digital data predicts the Big Five personality
traits, Pearson’s r will be used after an effect size for each research is
found. It is already expected that not all the papers have executed the same
method to investigate the relationship. Decisions regarding the effect size,
how to include it in the meta-analysis.
If the Pearson’s R is not reported in the paper, the reported
effect-size can be converted to correlations. In the case when there is no
information about the effect-size, and also not enough additional information
to determine correlations, authors can be contacted in an attempt to obtain
relevant information. When this gives no results, the papers will need to be
excluded from the study.
For each of the Big Five personality traits, separate meta-analyses will
be performed using a random-effects model, since the true effect size will
probably vary in the individual studies. For the identification of the
outliers, Grubb’s test will be conducted iets.
The chi-square Q test of heterogeneity, T2 estimate of true variance and the I2 statistic of proportion of true variation in the effects that
are going to be observed will be computed to determine the heterogeneity of the
effect sizes of the studies.
In the final phase, with the help of meta-regression models, potential
moderators will be analysed. Using restricted maximum-likelihood estimation,
potential effects of moderators, by random-effects univariate, on study-effect
sizes will be determined.
To perform the analyses, a software called ‘Comprehensive Meta-analysis’
will be used.