All posts by dgs0323

Character Limits when Loading Wide Fixed Width Files into Hive using Serde.*

hive-logo

Serde (Serializer and Deserializer) provides the functionality to read fixed width files into Hive. It is fairly simple to use once the syntax is mastered. One simply sets the width of each field in the fixed width file and the file is split into columns of the corresponding widths. However, Serde has difficulty with very wide files, for instance 1000 columns. The reason for this can be found: https://issues.apache.org/jira/browse/HIVE-1364

Because of the previous versions of the various hive metastore databases (e.g. MySQL, Oracle) that could be used the original character limit for a Serde string was 767. This particular ticket increased the capacity to a character limit of 4,000. However, in my recent experience, this was not enough. The Serde regular expression field is truncated to 4,000 characters. (E.g. If you have column definitions that span 5,000 characters, everything after 4000 is deleted.) The solution to this problem is simple (if you have admin privileges): Update the datatypes in the hive metastore to store more than 4000 characters for the Serde regex. Increasing the size of the datatype will not affect the metastore table, and all metastore database types have a capacity beyond 4,000. 8,000 characters of regular expressions to delimit the fixed widths files was sufficient for my use case.

*The purpose of this blog is to direct people to a method of solving a nuanced problem to which I had a degree of difficulty finding a solution. This is not meant to be a tutorial, but in my case the remedy was simple, once I figured out the problem. If you are one of a relatively small group of people who happen to run into this problem, I hope your Google Juju brought you here quickly.

Personality Psychology: CliffsNotes for the Water Cooler

The new city water-cooler-discussion ordinance goes into effect, and certain employees are banished to the balcony.

 

Perhaps more so than in other academic disciplines, people are interested in psychology. Most people possess a ‘lay’ understanding of what makes the others around them tick. In fact, some researchers have even argued that as humans we are hardwired to innately understand social interactions (Spelke, 2007). Whatever the reason, it seems that even non-psychologists spend a good deal of time thinking about psychological questions. In contrast, I rarely hear people express their interpretation of chemical bonds or how electricity works. However, psychological questions fascinate people: What made a person do that? What would someone do in this situation? In my opinion, this is great. It means we, as psychologists, are exploring topics that the general population finds interesting.

However, I am always a bit taken aback when people argue their opinions, intuitions, or anecdotes against decades of research. My frustration over this topic was recently exacerbated by two terribly misleading and uniformed articles published by a usually respectable news source (NPR; Personality change, Personality tests). This short post can hopefully orient people with some of the major understandings in personality research.[1] Then, people can have informed conversations about a topic that I know is of interest to many people: how are people different from one another?

First, what are the important dimensions of personality? How do people differ from one another? There are thousands of adjectives describing individual differences of personality (Allport & Odbert, 1936). Each describes a certain nuanced characteristic of people. However research has shown time and time again (Goldberg, 1993, Costa & Mcrae, 1987) that a large proportion of the variation in personality (individual differences) can be described with just 5 traits: Extraversion (the extent to which one is social or dominant), Neuroticism (the extent to which someone is emotional or sensitive), Conscientiousness (the extent to which one is responsible and dutiful), Agreeable (the extent to which one is pleasant and not confrontational), and Openness (the extent to which one is open to new experiences and interested in the arts/intellectual endeavors). Numerous studies have shown the replicability of these dimensions of personality, and their ability to predict meaningful outcomes. Now, I am not asserting that these are the only dimensions of personality that matter, but any conversation of personality that neglects these dimensions is necessarily limited. Failing to measure these dimensions leaves researchers susceptible to the jingle jangle fallacy (renaming a well-known construct and purporting it as a new discovery; e.g. grit; or as psychologist have been calling it for decades, conscientiousness, article)

So we have established that if you want to talk personality, you need to at least include the Big 5 in your discussion, but what about personality tests? Are they meaningful? A recent article on NPR, (Personality tests) recently critiques personality tests implying they are not meaningful. [2] It is also not uncommon to hear people criticize the validity of tests; not understanding an item, not knowing how to answer a question. Despite these criticisms and people’s intuitive distrust of personality tests, personality (as measured largely by self-report personality tests) predicts basically everything: Health, financial and professional success, relationship satisfaction (Ozer & Bennet-Martinez, 2005; Roberts et al, 2007), and a number of other outcomes. The utility of (good) personality tests is really not up for debate, no matter how someone feels about a given item or two. As an extreme example, the Minnesota Multiphasic Personality Inventory (MMPI) includes an item ‘I prefer showers to baths.’ The psychologists who developed this test clearly were not interested in one’s bathing preferences. However, marking this item is an indicator of one’s empathy. Whether or not the person taking the test understands why he or she is being asked certain questions is not of concern. It is the extent to which those items are indicators of personality that matters.

Another common criticism of personality tests is their susceptibility to faking. This is especially common when personality tests are being used for some sort of screening procedure (e.g., employment). Often times people assume they can ‘beat’ the tests. However, in most cases this is false. One study tested this possibility in a real-world setting. Researchers told participants they failed a pre-employment personality tests and gave them another opportunity (Hogan & Hogan, 2007). It turns out that most participants—95%–scored the same. In fact, only 2.5% scored better, and actually 2.5% scored worse the second time. It turns out that people are not as good at ‘faking’ as they think.

Another common misconception (propagated by the Invisibilia podcast on NPR; Personality change) is the belief that it is situations or the environment that determines behavior, not personality. In a 1968 book, Walter Mischel declared that because personality traits—much like the Big 5 discussed earlier—could not strongly predict how people behave across situations, that the situation is what determines behavior. Since the publication of Mischel’s book, people have interpreted it to be evidence that personality is not important or, in the most extreme cases, that personality does not exist.[3] Since 1968, numerous studies have shown that this simply is not the case (e.g., Sherman, Nave, & Funder, 2010). It turns out that although people behave differently at a party compared to a classroom, there is a large degree of rank-order consistency. In other words, the most social person at a party is probably also one of the most social in a classroom, despite the differing demands of these two situations. Further, over a large number of situations, people show a substantial degree of consistency and the effect sizes dismissed as trivial by Mischel have meaningful implications (aggregation effects; Abelson, 1985; Epstein, 1979). Also, remember, (two paragraphs ago) personality predicts nearly everything. It necessarily exists.

Since we established that people do behave consistently across situations, there is the question of whether or not personality changes throughout people’s lives. Everyone has anecdotes in which they know someone who completely turned their life around for the better. The Invisibilia podcast (Personality change) tells a story of a prisoner who has allegedly turned his life around. Aside from the fact that this is one person and hardly the rule, and he is still in prison. (We just have to take his word for it.) It is a nice story that anyone can turn their lives around (e.g., redemptive narrative; McAdams, 2004), and research has suggested that people can change their personality in an effortful manner (Hudson & Fraley, 2015), but these stories are the exception rather than the rule. People do change in predictable ways over the course of their lives, often becoming more conscientious and more emotionally stable (Roberts, Walton, & Viechtbauer, 2006; Caspi & Roberts, 2001). However, these changes are generally rather small, and people tend to maintain rank order consistency (i.e., the most conscientious people tend to stay fairly conscientious and the least conscientious people only increase slightly).

So to reiterate and hopefully clear up some all too common misconceptions about personality and people in general, let me summarize: 1) There are five really important dimensions of personality, and any discussion should include the Big Five. 2) Personality tests work. It doesn’t matter if you understand why they work. 3) Faking personality tests is difficult and you’re probably not as good at is as you think. 4. People behave differently in different situations, but across a lot of them people behave consistently. 5) Personality is relatively consistent throughout people’s lives. Your personality can change, but it is usually a slow process. While intuition or anecdotal experience may suggest otherwise, these findings have been supported by decades or research and hundreds (if not thousands) of studies.

 

[1] I am by no means the only (and far from the most qualified) personality psychologist to rebut these two articles. An informative piece by Chris Soto was published by NPR (link), and social media was alive with outrage among personality psychologist. However, the target audience here is not the academic psychologist, but anyone fascinated with personality (probably without a PhD).

[2] Indeed, the person who wrote the article was someone who has staked their career on demolishing personality tests. They wrote a book titled The Cult of Personality Testing. I’m not sure how much stock we can place in someone who stands to profit from bashing personality tests (i.e., selling books.)

[3] Though Mischel will claim that he never said that, see, Most Cited, Least Read? (DOI:10.1093/acprof:osobl/9780199778188.003.0001

 

A means to an end

images

What is the difference between a computer scientist, a statistician, a research psychologist, and a data scientist?Certainly, there is some overlap between these disciplines, but each has unique skill sets. A lot to do with the questions that we ask. Statisticians tend to study questions like: How do we calculate degrees of freedom for Mixed Effects Models (link;link). Computer scientists study questions like: How to extract network level properties without traversing an entire network (link;link)? My Ph.D. studies in Experimental Psychology trained me to ask certain questions: What are people like? How does this affect behavior? More broadly: How can we measure this construct? How can we use this data to predict or understand something meaningful?

 

I believe these different interests are simply a matter of paradigm: ‘tool creation’ vs ‘tool use’. Although I have contributed to statistical software (link), and published methodological work on statistical artifacts of certain personality assessment tools (link), I tend to leave development of statistical techniques and computer programs to much smarter people. Could I write a statistical package to implement a random forest regression model? Probably. It would take me a while, but eventually I could probably do it. Is that the best use of my time? Probably not. Someone else can do it fast and better, and more importantly, it has already been done done (link).

 

I prefer to apply this awesome machine learning technique to really useful problems: Predicting who is going to default on a personal loan? Calculating the meaning in a Tweet without ever reading it (link)? These application-type questions are the ones that excite me, personally. My PhD advisor, Ryne Sherman, describes this as the ability to operationalize business or theoretical questions to research questions, and in my experience it is a very useful skill to possess.

 

However, as counterintuitive as it may sound, in my experience the best ‘tool creators’ are not always the best ‘tool users’. Applying these different analytic and statistical techniques techniques requires an understanding of the big picture goals. It also requires a sound understanding or experimental design. It requires not losing the forest for the trees.

 

1I would like to preface this by saying that, as a personality psychologist by training, I am of course biased in the opinions expressed here.

What is Twitter telling us in 140 Characters?

TwitterLogo_white TwitterLogo_#55acee

 

What is Twitter telling us?

140 characters is not a lot. 2,800,000,000 characters is a different story. In my recent research  with Ryne Sherman,  we analyzed 20,000,000 Tweets (i.e. 2,800,000,000 characters) to discover what people all over the U.S. experience in their everyday lives.

Frustrated with small scale studies using undergraduate samples, we knew their had to be a better way. Researchers were starting to uncover the power of digital foot prints, and these studies were ground breaking. However, most the studies at this time were looking at what digital footprints could tell us about the individual. We wanted to know what this can tell us about people in general: What do people do? When are they the happiest? Saddest? What is life like in the city compared to the country?

People have general ideas about the answers to these questions, but most these ideas basically amount to intuitive guesses. No one has empirically answered this question. The key insight for this study came when we realized that people share millions of experiences on Twitter. We realized we could use Twitter to collect the largest collection of experiences that anyone has analyzed, using a comprehensive measure of psychological experiences.

What did we find?

The situations that people share are, on average, more positive than negative. That was a pleasant finding. Further, people are happier on the weekends than during the week. No surprise there! People who are Tweeting in the late night night hours are usually not having a good time. (Ryne likes to say “Nothing good happens after midnight.”) Also, we compared Urban to Rural areas. It turns out there are very few psychological differences between life in the city and life in the country.

Finally, we found gender differences. Females experience both more Negativity and more pOsitivity than males (Fig 1.). Here you can see that Females experienced more Negativity (red) than males (orange), and more pOsitivity (green) than males (blue).

Fig 1. Situation Experience by GenderSMS_Fig3.4

What does this mean?

Aside from being one of the most descriptive studies to date exploring what people actually experience in their daily lives, we also developed a new method for assessing textual data sources. By using machine learning to approximate human judgments we assessed a corpus that would be completely inaccessible using traditional methods. By our calculations, it would have taken about 500,000 hours to rate these 20 million Tweets by hand. We were able to assess four thousand times as many experiences using this method than we were able to using traditional human coders. This study demonstrates the application of Big Data analytics to an obviously important psychological question, highlighting the potential of these methods for researchers in any number of domains.

A lesson from R-fortunes: all science is not good science.

images-2

“It is becoming apparent that you do not know how to use the results from either system. The progress of science would be safer if you get some advice from a person that knows what they are doing.”

— David Winsemius (in response to a user that obtained different linear regression results in R and SPSS and wanted to know which one to use)      R-help (July 2011)

I can always count on my fortunes R-package for a good laugh (especially at the expense of SPSS users), however, this post raises an interesting point about the misuse of statistics.

First, let me digress. Before undergraduate level coursework in psychology, I didn’t know much about the way people acted. After some undergraduate level classes, I knew everything about the inner workings of the mind. I knew that priming people with stereotypically older words reduced their walking speed (Bargh, Chen, & Burrows, 1996), that the Implicit Association Test (IAT; Greenwald et al., 2002) measured meaningful unconscious attitudes, that narcissism was associated with using more first person pronouns (Raskin & Shaw, 1988), etc. It wasn’t until several years in graduate school, advanced statistical training, reading some meta-research, and a visit from the replication police that I realized a) that the findings are never as clear cut as they seem and b) all of these findings have been called into question (Priming; Doyen, Klein, Pichon, Cleeremans, 2012; Pronouns: Carey et al., 2015; IAT; Blanton et al., 2009). Further reading reveals p-hacking (Simonsohn Nelson, & Simmons, 2014), incredibility indices (Schimmack, 2012), and that half of all published findings may be false (Ioannidis, 2005).

I hope this digression illustrates the point that a little knowledge and a false sense of understanding can be dangerous. A novice statistician who runs participants until his or her hypotheses are statistically significant might not realize he/she just increased type one error rate to 20% despite a p < .05 statistical test (Sherman, 2014), but those findings get published.

This brings me back to the original (humorous) quote from my R-fortunes package. Misuse and misunderstanding of analyses are some of the reasons that so few findings across many scientific disciplines do not replicate (Freedman, Cockburn, & Simcoe, 2015). I think the ‘take away’ from this ‘fortune’ (and blog post) is that statistics are often misused and abused, sometimes knowingly and other time unwittingly. The scientific process is slow and self-correcting, but not perfect. Published papers are not necessarily error free. Interpret analyses cautiously. Interpret the research of others cautiously. Most importantly, use R, not SPSS.

References

Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of personality and social psychology, 71(2), 230.

Blanton, H., Jaccard, J., Klick, J., Mellers, B., Mitchell, G., & Tetlock, P. E. (2009). Strong claims and weak evidence: reassessing the predictive validity of the IAT. Journal of Applied Psychology, 94(3), 567.

Carey, A. L., Brucks, M. S., Küfner, A. C. P., Holtzman, N. S., große Deters, F., Back, M. D., Donnellan, M. B., Pennebaker, J. W., & Mehl, M. R. (2015, March 30). Narcissism and the Use of Personal Pronouns Revisited. Journal of Personality and Social Psychology. Advance online publication. http://dx.doi.org/10.1037/pspp0000029

Doyen S, Klein O, Pichon C-L, Cleeremans A (2012) Behavioral Priming: It’s All in the Mind, but Whose Mind? PLoS ONE 7(1): e29081. doi:10.1371/journal.pone.0029081

Freedman LP, Cockburn IM, Simcoe TS (2015) The Economics of Reproducibility in Preclinical Research. PLoS Biol 13(6): e1002165. doi:10.1371/journal.pbio.1002165

Greenwald, A. G., Banaji, M. R., Rudman, L. A., Farnham, S. D., Nosek, B. A., & Mellott, D. S. (2002). A unified theory of implicit attitudes, stereotypes, self-esteem, and self-concept. Psychological review, 109(1), 3.

Ioannidis, J. P. (2005). Why most published research findings are false. Chance, 18(4), 40-47.

Raskin, R., & Shaw, R. (1988). Narcissism and the use of personal pronouns. Journal of Personality, 56, 393–404. http://dx.doi.org/ 10.1111/j.1467-6494.1988.tb00892.x

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file-drawer. Journal of Experimental Psychology: General, 143(2), 534.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551.

Sherman, R.A. (2014) phack: An R Function for Examining the Effects of p-hacking. retrieved from: http://rynesherman.com/blog/phack-an-r-function-for-examining-the-effects-of-p-hacking/

It is all just numbers, really . . .

I work as a Data Scientist for a database marketing company, and I spend a great deal of time predicting responders to credit based marketing offers, defaults on loans, and analyses of that nature. However, my graduate training (and expected PhD) is in Experimental Psychology.[i] When people find this out, I often get a confused look and the question: How did you get into this business?

Whenever I find myself having this same conversation, I am reminded of a scene from the movie Margin Call that portrays an insider view on the financial crisis. Here is an exchange between one of the “Quants” and a member of senior management.

  • What’s a specialty in propulsion, exactly?
  • My thesis was a study in the way that friction ratios affect steering outcomes in aeronautical use under reduced gravity loads.
  • So, you are a rocket scientist?
  • I was.
  • How did you end up here?
  • Well, it’s all just numbers, really. You’re just changing what you’re adding up . . . 

While I am not exactly a rocket scientist, I share his sentiment. A Support Vector Machine Regression model does not care if I am trying to predict a personality trait or the likelihood of defaulting on a loan. The numbers are the same. That is the beauty of math. It is universal. Techniques that I apply to large-scale analyses on social media can be adapted to study nearly anything else that I find interesting. So the next time someone asks me how I got into the database marketing business, I will tell them, “Well, it’s all just numbers, really. You’re just changing what you’re adding up…” Or, I might just point them toward this blog post.

 

[i] This confusion is no doubt compounded by the confusion between Psychologists, Counselors, Psychiatrists, etc. and research psychology, but that is a conversation for another blog.

Is “big data” a passing trend?

It seems like these days everyone is trying to do “big data” research, and even more people are marketing themselves as “data scientists.” There are accelerated “data science” courses at nearly every online university. Everyone wants to get into the game.

“Big data” is sexy, but is it simply a trend? There is a saying among investors: When your taxi driver starts to tell you about a stock, it is time to start thinking about selling. By that reasoning, some may think that “big data” research has peaked.

Don’t worry, “Big data” is here to stay. Unlike stock price speculation, data science is not a zero sum game. It fills a growing need. Whether that need is academic or industrial, insights can be gained from analyzing large datasets. Despite the trendiness (relatively) few people have the skills necessary to unlock these possibilities.

These skills are not limited to computer programming or statistics, although these competencies are definitely a requirement. However, even more than the specific skills required, I believe that a certain mindset is important for data scientists. This mindset is what allows a researcher to make the transition from “I am interested in studying this phenomena.” to “Here is HOW I can study this phenomena”

These skills make a good data scientist extremely valuable in industry or academia. With the amount of data being produced daily increasing exponentially, the possibilities available from “big data” will only get larger, and these skills are only going to increase in demand.