Tuesday, September 12, 2017

Regression with categorical variables: why have intercepts?

Or, "I understand Degrees of Freedom a little more now."
(This is kind of basic, so if you're good at regression, please bear with me. Also, apologies for trying to use Blogger to display data, yes I should probably use a table or like a Jupyter notebook, but... well, bear with me again.)

Ok, first imagine you have 1 categorical variable that predicts something, like your score on some test. Say the categorical variable is "do you like Star Wars or Star Trek better", and your data looks like this:

  • SW, score 3
  • ST, score 4
  • ST, score 4
  • SW, score 3

(I mean, I grew up on Star Wars myself. But it's hard to argue that it's a smarter show :P)
You can do this regression a couple ways:

  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x2 = 1 if they like Star Trek better, 0 otherwise
  • Regression equation: Score = 3*x1 + 4*x2


  • x0 = 1 always (call this the "intercept")
  • x1 = 1 if they like Star Wars better, 0 otherwise
  • Regression equation: Score = 4*x0 + (-1)*x1

You cannot do this:

  • x0 = 1 always
  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x2 = 1 if they like Star Trek better, 0 otherwise

Because you run into "nonidentifiability" because your predictors become collinear with the intercept. I mean, try it - try to fit a regression equation. Should it be this:
Score = 3*x0 + 0*x1 + 1*x2
Score = 0*x0 + 3*x1 + 4*x2
Score = 1000*x0 + (-997)*x1 + (-996)*x2
? All these fit the data perfectly well. You've got too many predictors. Another way of saying this is, you've got too many degrees of freedom.

But the question still remains, which two variables do you use? Like, the first way (with x1 and x2) seems really appealing, because you straight-up get the answer of how important each predictor is. But if you get used to using all levels of your categorical variables and having no intercepts, well, you may fall into a trap! Let's see how...

Now imagine you have 2 categorical variables that predict something, like, I dunno, score on some test. Say the variables are "do you like Star Wars or Star Trek better" and favorite kind of small fish (sardine, mackerel, sprat, or herring; there are only 4 fish in this world) And imagine our data looks like this:

  • SW, mackerel, score 5
  • ST, mackerel, score 6
  • ST, herring, score 8
  • ST, mackerel, score 6
  • SW, sprat, score 6
  • SW, sardine, score 4

You might be tempted to code them like this:

  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x2 = 1 if they like Star Trek better, 0 otherwise
  • x3 = 1 if they like sardines best, 0 otherwise
  • x4 = 1 if they like mackerel best, 0 otherwise
  • x5 = 1 if they like sprats best, 0 otherwise
  • x6 = 1 if they like herring best, 0 otherwise

But then you get the same problem. Is it:
score = 3*x1 + 4*x2 + 1*x3 + 2*x4 + 3*x5 + 4*x6
score = 4*x1 + 5*x2 + 0*x3 + 1*x4 + 2*x5 + 3*x6
score = 1004*x1 + 1005*x2 + (-1000)*x3 + (-1001)*x4 + (-1002)*x5 + (-1003)*x6
? Again, these all fit the data perfectly.

Instead, you can do this:

  • x1 = 1 if they like Star Wars better, 0 otherwise
  • x3 = 1 if they like sardines best, 0 otherwise
  • x4 = 1 if they like mackerel best, 0 otherwise
  • x5 = 1 if they like sprats best, 0 otherwise

And then you just know that, if x1=0, you've got a Star Trek fan, and if x3 x4 and x5 are all 0, you've got a herring eater.

But it doesn't quite fit the data. You can kinda tell that Star Trek gives you a 1-point boost over Star Wars, and you kinda know that herring > sprats > mackerel > sardines, but you can't model the fact that you just always have a baseline score. Or rather, imagine a Star Trek-loving herring eater; all the variables would be 0, so you have to predict their score is 0. Obviously that is not the case.

But to solve this, we just have to throw in another intercept:
x0 = 1 always.

Then we can fit exactly one regression equation that perfectly fits the data (or, minimizes the sum of squared error):
Score = 8*x0 + (-1)*x1 + (-3)*x3 + (-2)*x4 + (-1)*x5

To put it another way: for each categorical variable, we have N levels, but we only get N-1 degrees of freedom. So our final equation should add N-1 terms. And you always start out with one DF (for the intercept). In our first example (with just Star Wars/Trek), we had one variable with 2 levels, so the number of DF we get is 1 (intercept) + 1 (2 levels). So our regression equation should have 2 terms. In the second example, we had one variable with 2 levels and one with 4, so we should have 1+(2-1)+(4-1)=5 terms. And doing the trick where one level of your variable is the "baseline" is really the only way to do that.

So, if you only have one categorical variable, you can ignore the intercept and use all levels, that's ok. But it falls apart pretty quick as your regression gets bigger, and you've got to do the by-the-book way where you use one level as reference for each variable and add an intercept.

Friday, May 19, 2017

State of the Geotags: Motivations and Recent Changes

... is the title of our recent paper at ICWSM 2017, "our" being me, Zichen Liu, Alex Sciuto, and all of our kind advisor Jason Hong.

The paper, in one sentence: think of geotagged social media posts (tweets, instagrams, etc) as postcards, not as ticket stubs; as conscious choices, not unconscious byproducts.

In three bullet points:
- when people are posting something to Twitter/Flickr/etc, they usually consciously choose to add their location; it's not a "set it and forget it" situation. (we found this out by analyzing how often people toggle between geotagging and not.)
- when people post their location, they usually do it from unusual or faraway places. They usually don't do it at home or in their neighborhood, and they usually don't do it from places that they go to regularly. (this is from surveys.)
- people are posting their specific location less than they used to. Some of this might be privacy; a lot of it is because Twitter changed the defaults on how your location is posted.

In more detail:
Paper Slides

Friday, March 31, 2017

Getting from Zero to What I Do Most Of The Time With Data

We've been getting a lot of undergrads and master's students coming on board in our lab, with pretty vastly different levels of experience. That is good! More diversity, the better, I say.

However, it tends to be hardest for those with the least experience. Often I'll say something like "just ssh in to the server, connect to our postgres database, and get all the tweets in this area." And they'll be like "oops I guess I was supposed to know what ssh and postgres are, but I don't, so now I'm either trying to bluff or googling frantically." Which is too bad! I think what they should do here is ask me for advice, but they don't know that. They might think that I'm either a jerk who would make fun of them for not knowing that much, or that I'm a person with wayyy too many responsibilities to possibly give them the time they need (i.e. a professor).

I do appreciate their concern for my time, though, and it's probably more fun for them to learn things themselves (go at your own pace, etc), so I've put together this list of useful guides.

Unix Computing Basics
How to Unix (mac) - work through Conquering the Command Line (chapter 1)
To get started with this, open up the program "Terminal" on your mac. You can do that by going to the magnifying glass in the top right and typing "terminal."
How to get a Unix-ish prompt on Windows - I don't actually know. Someone suggest me a tutorial for this.
How to Unix (Linux) - despite this being the year of the Linux desktop, few people have one. If you do, open up a terminal however you do. I used to run Ubuntu and it made that pretty easy.
SSH (to connect to a remote server and navigate around there) - the "basic syntax" part is fine. You probably won't need the "keys" bit but it might be fun if you want to look around later.
SFTP (if you want to download a file from a remote server)
git: try Software Carpentry's git novice course. (parts 1-9 especially.) All of Software Carpentry's stuff seems pretty good.

Vim and other Text Editing
You should probably know at least the basics of Vim, because it's installed on every computer ever, and you always need to edit text files. Also, sometimes you'll end up in vim for some reason and it's good to be able to quit. Learn Enough's text editor class (at least chapter 1, Vim) seems like a good place to start.

Software Carpentry has a good lesson here too, specialized for research computing.
For more general python, or if you are starting from zero programming, you might have more luck with Learn Python The Hard Way (for everything; long, but you can breeze through the parts you already know.)
pip and virtualenv. Generally you should make a virtualenv for each project. Stuff may work without it, but then you've got global dependencies (so if you need module A to be v3.0, but you update it to 4.0 for some other project, your old thing that's still expecting it to be 3.0 may stop working. virtualenv gives you a separate copy of module A for each project that needs it). You might see some places recommending you use conda instead; it's fine too, I have less experience with it, but it'll probably get you where you need to go.
The csv module is particularly useful, here is a guide for it. As is the Argparse module; tutorial here.
If you need to send out HTTP requests, use the requests module.

PostgreSQL - this is an ok tutorial. Part VII may be more complicated than you need. I'd love to see a better tutorial too. SQLBolt may be this better tutorial.
A bit about PostGIS - postGIS is a library that lets you use geo data in your PostgreSQL database somewhat sanely. You can probably skip most of this. SELECT * FROM tweet_pgh WHERE ST_MakeEnvelope(-79.9, 40.44, -79.899, 40.441, 4326) ~ coordinates; is probably what you need.

There's probably a lot more useful stuff I could put here! Let me know if you've got anything I should add. Also, tell me if you have feedback, good or bad, on any of these.

Monday, October 31, 2016

Will it hamper counter-protests if I check in at Standing Rock?

Meme going around on Facebook: you should check in at Standing Rock, ND, in order to help protesters. This will help them because the sheriff's office is using Facebook to target protesters. If they see a flood of Standing Rock checkins, they won't know who's actually there. Snopes has a decent rundown.

Is it useful for randos to check in at Standing Rock? Depends on what the sheriff's office is doing, but probably not.

If they're real dumb guys: one of them has an idea like "hey let's check facebook to see who's protesting here!" And they go to https://www.facebook.com/pages/Standing-Rock-Indian-Reservation/109268902425837, and then shrug and say "welp, none of these people are actually in Standing Rock :P"

If they're average randos: they do the above plan. And then look at other local organizations or places that are not the exact Standing Rock facebook page, but are around the area. That ought to give them a decent list of protesters. If they want to find all the protesters, sure, they won't be able to, but they couldn't anyway. If they want to find at least a few protesters to lock them up and make an example, that'll work. But then, they can probably do that already.

If they've got a good programmer on staff: they'll maybe be able to use the Facebook API to look back in time at who's checked in before today, and find some real protesters. (OR if they've got a secret pipeline to the NSA all-of-facebook, which I'd guess is unlikely but not impossible these days.)

In any case, flooding the Standing Rock facebook page won't help unless the local police are real dumb guys AND they decide to do something illegal. You can't round up and lock people up just because you know they're at a protest. (Of course, sometimes cops do it anyway. It's possible they're real dumb guys and they decide to step around the law.)

So, another chapter in the continuing saga of "geotags? ¯\_(ツ)_/¯".

Thursday, September 15, 2016

Paper Writing, and how to do it better next time (spoiler alert: no insights, just ideas)

Paper writing is the #1 inherent drag, for me, in the academic thing. There are other drags that are just accidental - like if the office printer stops working or whatever - but you can't really have academia without some kind of papers. (at least academia in its present form, hedge hedge disclaimer disclaimer) So here I'm wildly introspecting, coming up with some thoughts.

Why is paper writing hard? Because you have to keep a thing in your mind that is bigger than you can keep in your mind. A 10 page paper is bigger than you can keep in your mind. By the time you're ready to write a paper, you've done a few mostly-related studies, and you've come up with a kinda-coherent story, but it's not really coherent, so you've got to do all kinds of mental gymnastics to remember it all.

What am I thinking next time? Do the work in this order:
-0.25. Read a little bit
0. Studies
0.5. Read a lot
1. Story
2. Graphs
3. Everything else

-0.25. Read a little bit. Make sure you're not wasting your time with the studies you're going to do.

0. The studies. This is just most of your work. You usually have an idea why you're doing these, but you're never really sure, until you're done. So, do the studies, gather the data, figure out your results, and just kind of putz around with them for a long time until they've seeped into your brain. Ask yourself lots of questions about your data and write python scripts to prove or disprove them.

0.5 Read a lot. Test out possible stories you can tell based on your data and read papers until you can tell if that is a decent idea or not. Maybe you should do this before your studies? But that 

1. The story. This is #1 because this is the time (usually 2 weeks before a deadline in a mild panic) when you sit down and say "ok! gonna write a paper!" You should probably get all the coauthors in a room and lock the door and you can't come out until you have a story. This sounds painful, and it probably is. Maybe bring beer? Do as the ancient Germans did and make decisions mildly drunk and then confirm them the next day sober? I am quite serious about this.

2. The graphs. These are your evidence. After you've figured out what you want to prove, figure out the graphs that will make your point. (Sub out "graphs" for "math" or "photos" or "collections of anecdotes from interviewees" as appropriate.) Make those graphs. I guess they can be kind of rough, you can visually polish later, but they should be able to make your point. (not sure about this; maybe you should polish them right now.)

3. Then the rest of it should be easy. I mean, all the putting down words. Nobody reads them anyway. Just reference your graphs a lot. (I am being sarcastic, it is never easy.)

I have no idea if this is a good list. What I've done for the papers I'm writing now is basically studies first, then everything else in a big old blender of worries, and it's not pleasant.

Tuesday, May 24, 2016

Mailing Archived Emails as Postcards: Probing the Value of Virtual Possessions

our (w/ David Gerritsen, Jennifer Olsen, Tatiana Vlahovic, Rebecca Gulotta, Will Odom, Jason Wiese, and John Zimmerman) CHI 2016 paper, in hopefully plain English

Ok, Gmail archives all your emails forever, right? There's probably some good stuff in there! Emails from people you care about, memories of good times, photos, conversations. But people don't see it as meaningful at all. OTOH, they do store a bunch of old physical photos and postcards. Why are those things valuable while emails aren't? More generally, why are physical possessions considered so much more valuable than virtual possessions?

That's why we set out on the study. Nope, hold up, that's not true. We set out on it because we (Dave, Jenny, Tati, and I) were young grad students doing a class project and we were given visions of turning it into a CHI paper, which would be a nice gold star to have on our resumes. Personally, I dove in because I was the biggest coder on the group, and I thought it'd be a fun little engineering challenge. (and it was! but that was about 1% of the project.) And a way to impress Tati with my skills. Like Napoleon Dynamite.

So. Virtual possessions, physical possessions. How about if we take those virtual things, those old emails, and turn them into physical things, like postcards? We can automatically sift through all our participants' old emails, pick out particularly meaningful (we think) snippets, print them on postcards, and mail them to them. Then they will probably think "oh yeah, that was a great old email" and rethink how valuable their old emails are. So we did that, over a 3 month period, and interviewed each participant 3 times, and our conclusion was:

Nope! They didn't really care at all. Most of the postcards, they just threw away. Oh well. But in talking to them, we realized a couple key things:

1. virtual possessions often lose value because they lack context. If you have an old photo, it's probably in your old photo album, next to other old photos. Or maybe a scrapbook, or a book of old good stuff you've saved. Your old emails? They're in a list with other (probably useless) emails. You can't really recall the whole memory just from a few words, and you've got nothing else around to help it.

2. virtual possessions are often useless because even the good ones are lost in a pile of trash! Your old emails are 1% wonderful conversations and 99% receipts from Amazon and ads from Bed Bath and Beyond. So you do this cognitive simplification by just considering it all junk. (It'd be a pain to try to remember or keep links to all the valuable stuff!) And even if you do find a couple good old emails, well... they're still there, if you need them, so what's the point of attaching any value to them?
Physical things don't have this problem: you throw out all your junk mail, so it no longer adds to clutter. But the blessing and the curse of email is that you can keep it all, forever.

So these insights might help you design more valuable virtual things, maybe!

Or maybe not! Maybe just say "eh we're saving your old emails for purely utilitarian reasons; we're not trying to replace your phone book too." Maybe virtual things will accrue value in completely different ways from physical things, and we just have to deal with it.

For more waffling and a fuller account of our adventures: our paper.

Getting Users' Attention in Web Apps in Likeable, Minimally Annoying Ways

our (w/ Josh Hailpern and Anupriya Ankolekar) CHI 2016 paper in hopefully plain english

Why are there still so many pop-ups? Even if you're sure that Your Website Dot Com really desperately needs to show me an ad for lawnmower fuel or a notification that you've updated some incoherent legalese in your terms of service, do you have to blot out my whole screen? Couldn't you do something that people will hate a little less?

So we ran a simple study where we have Mechanical Turk workers play the game Set while we try 15 different ways to get users' attention and then ask them how well they like them.

We found, based on survey questions after they finish the game, that they find some of the attention grabbers more annoying than others, and some of them more noticeable than others. These usually correlate (more noticeable = more annoying) but you can sometimes get a little more noticeable without getting annoying, or vice versa.

We didn't find that certain attention grabbers make people better or worse at Set, or that they make them respond faster, or that they make them remember things better, or that they change the overall usability of the system, or the overall immersion in the game. Probably other things we didn't find too, see the paper.

Based on what we found, it looks like glowing shadows are a little better on average than popups (better = equally noticeable but less annoying), that your popup could be less annoying if it doesn't cover the screen behind it, and that the little message icon with a badge (like on Facebook or Twitter, showing you how many notifications you have) is good for low-interruption needs.

My confidence in the results: low! These are small effects. And you could poke a lot of holes in the study (why these 15 attention grabbers? how will this respond in a real-world situation? did people just kinda like our glowing shadows because they're prettier than some of the other options?) - we point out some of these in the paper. But it's something, and science hopefully progresses by a bunch of tiny steps.

Here's the paper!

Saturday, May 21, 2016

Story arcs

A lot of people have a "story arc" that they use for a lot of their papers/stories.

Some I can think of:
"What is it so what": "what is it", "is it so", "so what" - John Zimmerman
"ABT": And, but, therefore - Randy Olson c/o Better Posters blog
"What's the problem, and who cares?" (to start) - what I've gathered from talking with Jason Hong
Heilmeier questions - from George H. Heilmeier - this is more about questions you ask yourself before starting a project, not a talk you give about it when you're done, but it wouldn't be terrible if you gave a talk answering all of them.

I'd love to hear others, if you have them. Ideally I hope to collect a bunch and then learn what's in common between all of them.

ICWSM 2016 neat things

Back from ICWSM! It was the first time I'd been there. Felt more like Ubicomp than anything else I'd been to. Lots of people finding some correlation with p<.00001 and r^2=0.3, so what does it mean? I mean, it definitely means something, but I'm kind of frustrated by how difficult it would be to turn it into an application. I think the social scientists were frustrated too, by people's lack of social science training. I think the computer scientists were all keenly aware of their data and method's weaknesses... but they still found something. (and it still got published.) Lots of interesting data, a lot of scraping things, and a lotttttt of Twitter. Less polite than CHI or CSCW, which is a double edged sword: on one hand, I was kind of taken aback by some blunt questions. On the other hand, if we're not disagreeing, how are we getting anywhere? I had two conversations end with "ok, let's agree to disagree," and they didn't feel great, but I'm open to the possibility that that's a sign of intellectual diversity or something else good.

Lots of shared data sets!

Neat things applicable to cities:
City Dashboard - kind of overwhelming, but a start!

LikeWays - recommend the most interesting path to a thing, not just the quickest. Someone with an iPhone, try this out and tell me what it's like.

"Will check-in for badges", Gang Wang - basically, Foursquare doesn't represent real mobility (of course); it's really only good for applications that don't really matter if you get them wrong (like recommending restaurants).

Emotions, demographics, and sociability in Twitter interactions, Kristina Lerman - I had wanted to do a study like this: correlate a ton of stuff in different geo areas, see what comes out. People in higher income places have more weaker ties. (there's a lot going on there, though; it's kind of hard to interpret, or know why that would be.)

Other neat papers:

Identifying platform effects in social media data, Momin Malik - uses regression discontinuity to understand sudden things that happen on social media, which are because of a thing the platform did, not because of real effects. For example, Netflix changed the labels on their reviews (something like "I somewhat like it" to "I sort of like it") which changed review scores to jump suddenly.

When a movement becomes a party, Pablo Aragon - there was a bunch of grassroots talk around elections in Spain, so they followed one party, Barcelona en Comú, to see if they stayed all grassroots and decentralized, or if they evolved into a hierarchical organization. They found two groups: one for the movement (which stayed decentralized) and one for the party (which got hierarchical).

"Blissfully Happy" or "Ready to Fight", Hannah Miller - you've probably seen this on the news, it's super popular. Some emojis look different on different platforms. I use :D a lot but I guess on an iphone it looks angry. Some emoji are hard to interpret even within platform. (those raised hands! what does that mean!) This can be a problem.

Other useful tools:
Bot or Not - is this twitter account a bot?
(another quick heuristic: if # of followers/# you follow < .1, it's likely you're a bot)
Face++: face recognition tools
Gender detector - Is this name male or female? (python) (a different one in ruby)
IBM Watson Personality Insights Service - give it text, it gives you Big 5 personality scores
Complex Contagion models: models a thing where you have to be exposed to something N times before you get it too.
CommonCrawl - if you ever need a huge crawl of the web.
Want to find a set of ppl with known ages on Twitter? Just search for tweeters wishing each other "Happy (N)th Birthday!" Similarly, want to know what time ppl wake up (to track daylight savings or something), just search for people saying "Good morning!" Twitter is big, and there are at least a few people who say almost anything.
For what people say more in a place than others: probability that it appears there minus probability it appears at all. From a paper about #foodporn.
I finally learned what a tensor is: an n-dimensional matrix. And there are tools like PARAFAC decomposition, which is similar to matrix factorization, which is useful in some cases.

Friday, May 20, 2016

How You Ought To Be Networking, for younger PhD Students

Here's one set of suggestions, from Jean Yang. Here's another list of suggestions, from Xiang Anthony Chen.

If I had to make a list, it would be one item:
1. Don't worry, it's okay.

Look, you're a young PhD student. Of course you're nervous and second guessing everything you do. In a conference environment, you'll hit all kinds of extra stressors: people asking about your work, people you're supposed to know, people you know even though you're not supposed to, weird group social dynamics, parties, talks you're supposed to be attending, talks you're supposed to be understanding, sleep deprivation, travel issues, big rooms of 500 people. I'm not going to tack on more anxiety by telling you more things you should and shouldn't do.

Because this is a blog and I can yammer on a bit, here are some more thoughts.

You can sit with the same people twice!
It's okay! My old advisor, Anind Dey, used to complain about CMU students all sticking together, so I got this anti-CMU itch, like I gotta go *network* more and avoid hanging out with my friends. This feels unnatural. Having a couple friends makes you way more confident. And what usually ends up happening is it's two people I know and three people I don't, and I make three strong connections instead of one weak one. And even if you don't meet any new people in a given half hour, you're probably strengthening older relationships instead. Every interaction with another person either builds your quantity or quality of relationships; both are good.

You don't need to seek people out and prepare talking points and questions.
This is always awkward, in my experience. Like, I'm a new kid, how am I going to have brilliant questions (or even valid questions) for your work? In the worst case, you run the risk of getting all fanboi. By all means, don't feel afraid to approach anyone, but don't feel like you have to go scavenger hunting and ticking off boxes.

It's useful to ambush randos at smaller conferences, not at CHI. Just find someone and start talking to them. I agree with it at a smaller conference (a few hundred people). I disagree at CHI. Most randos at CHI do something totally different from you and you'll never see them again. Anyway, if you're not great at ambushing randos, that is fine too; you'll meet people via friends of friends and other ways anyway.

Relatedly: Student Volunteer at smaller conferences, and not at CHI. At smaller conferences, it's great, gives you something to do when you don't have many friends yet, makes you some friends, and saves you a few hundred bucks. But at CHI, you meet a bunch of randos and it sucks up your entire week (including staying late or getting up early, which adds physical stress to an already-stressful week). If your advisor really wants you to SV, tell them you'd rather pay the conference fee yourself. (of course, this is only if you can afford to do so. btw, go to grad school in Pittsburgh so you can afford to do so :)

Jean and Anthony have a lot of good points! Including these: wear comfortable shoes, carry a notebook, always wear your nametag (shorten the cord a bit so it's easy for people to see your name while talking with you), carry a jacket (even if it's warm; conference rooms are often cold), skip sessions (especially the early morning ones! you can't do anything if you don't sleep!), eat healthy.

Make friends, talk to them a lot, they'll be your colleagues forever. (from Jeff Bigham on Twitter (1, 2, 3))
don't worry too much about talking to the famous ppl, your friends will be famous soon! so, be merry w/ them
…and, don't take it too badly if the famous person is off hanging w/ their friends, you'll know/be them soon enough

Our House, in the Middle of Our Tweets: A summary in plain English

... I hope! Tell me if this is not actually as plain English as I hope it is. For the tl;dr, just read the headings.

1. We did a pretty good job of finding where people live, if they've posted geotagged tweets.

By "geotagged tweets", we mean "tweets with a lat/lon point." This is rare: about 1% of tweets have this. When you use Twitter, your tweets are not geotagged by default; you have to go in and select "yeah, add my location." (now, as of a few months ago, you even have to click another button that says "share precise location", so not many people do it.) But some people like to do it, to show that they are somewhere or remember or who knows why.

We tried to tell where they live at the neighborhood level. We could find about 79% of users' homes within 1km. (56% within 100m, 88% within 5km).

How do we know we found their homes? We collected 195 people's addresses in Pittsburgh by asking them in an online survey. (We asked the 4119 most common geotagged tweeters in Pittsburgh, 195 responded, after filtering out spam etc. We paid them with a $5 Amazon gift card.)

2. It's not that hard: remove daytime tweets and social cross-posts, and use grid search.

If you're trying to find someone's home, first take out all the tweets during the day (6am-8pm). Then take out all the social cross-posts from Foursquare and Instagram and all other social apps. In both of these cases, you lose a little bit of signal and a lot of noise. Like, your daytime tweets are sometimes at home and sometimes away, but your nighttime tweets are way more often at home.
Then use grid search. Bin all tweets into 1-degree lat-lon square, and pick the square that has the most tweets, and throw out the rest. Then bin those tweets into 0.1-degree squares, and pick the square that has the most tweets, and throw out the rest. Do the same at 0.01-degree and 0.001-degree. Center of that square is their address.

This might seem like a simple algorithm, and it is! We tried a bunch of more complicated things (see paper for details) and they didn't work as well.

3. However, this turns out to be more useful to learn things about places than about people.

Ok, pretty neat result, but sort of not awesome, for two reasons. First, 79% isn't that great - you can't really build that into a product if it fails 1/5 times. And there's good reasons we can't get much better - maybe 85% but probably not higher (see the paper). Second, as I just explained, almost nobody geotags their tweets! What good is a "learning about people" algorithm if it can only learn about 0.01% of the population?

Here's what it might be good for: learning about neighborhoods. If we can figure out where a bunch of people live, then we can put together a set of people who live in your neighborhood, and figure out what they're saying. That's what we're currently thinking.

More: read the paper!

Thursday, May 12, 2016

CHI 2016 good stuff

Hi! I am heading home from CHI 2016 right now. A, I'm tired, B, it was fun, C, I finally gave a talk at a conference.

Here's some things I liked. ("et al"s implied.)

Atlas of Me, Yea-Seul Kim: puts distances and areas in terms you can understand.

Folk theories of social feeds, Motahhare Eslami: really collect and understand the numerous folk theories people have of why some things get shown on their news feed and some things don't.

Geography and importance of localness in geotagged social media, Isaac Johnson: how much of geotagged Twitter comes from locals? Depends how you count, but about 75%.

Evaluating the IoT Through Craft, Jessa Lingel: "Internet of Things" often trades off control, agency, autonomy, and privacy for convenience. This would be really distressing for a craftsperson. Some good stories about how

PowerShake, Paul Worgan: transfer some battery from my smartphone to yours. Surprisingly, there are human challenges around it (people wouldn't transfer energy to strangers, for example).

Journeys and Notes, Justin Cranshaw: "check in" to your commute, not to a place. Build community around the "non-places". I tried this, and never really connected with anyone else, but I really wanted it to be great; he was pretty candid about the start-up issues necessary (and sorta unfulfilled) here, but also had some things they learned too.

From Research Prototype to Research Product, Will Odom: see, I want to do this kind of research a lot. Build things, see how people react and use them, learn from it. It's not about making a prototype better next time. I think this is what a lot of people do and then they say "it's a technology probe" even though it doesn't really do what the Technology Probe original authors were on about. So instead of trying to wedge it into the "technology probe" space, we can call it a "research product."

Ritual Machines, David Kirk: in the same vein as Research Products. Example of learning things about designing objects by designing a couple of bespoke things for some families who travel for work sometimes.

The art exhibition opening right near the convention center: super cool interactive stuff!
Voltaire Coffee Shop and Vero's Coffee Bar - two great shops within 2 blocks of the convention center
Getting a year older and knowing more and more people every CHI - this continues to be fun

Saturday, April 30, 2016

Thesis proposed!

One step closer to graduation, and also a step closer to making a really neat and useful thing.

In a noun phrase: neighborhood guides, built from people's public social media posts.

In a sentence: I'll build guides to city neighborhoods out of people's public social media posts, to help people traveling find places to stay and places to hang out.

In a presentation: here! (11mb pdf) If you would rather read a more boring document, you can do that too I guess, here.
Usually I think a presentation to give and a presentation to read should be two different things. The presentation should be more visual, less words. But, you can't trust a roomful of academics to listen to you, so you have to put the words on the slides too so they can read them and tune out. And the presentation pdf linked here includes my speaker notes, so it might be somewhat comprehensible.

If you don't want to read a long document or talk, here's a summary:

Tourism's changed over the years - people used to all want to relax ("sun and sand"), then some of them wanted to see sights ("cultural tourism"), and now some of them want to be more active in guiding their own experience and discovering a place themselves ("creative tourism"). Guidebooks are mostly aimed at cultural tourists ("here are the sites to see, here are the top N hotels to stay in, etc") while creative tourists want to know more about the neighborhoods.

I'm developing a model of what creative tourists want based on 24 interviews (so far). It looks like they want something like this:
Aesthetic appeal
The "Ideal Everyday" - a picture of everyday life but focused on when you're relaxed and can explore at your leisure
Authenticity - ... whatever this means to you

So based on those six dimensions, I'm going to mash together crime statistics, census data, Walkscores, Flickr photo autotags, Yelp Third Place reviews, and Tweets to show you a guide of the neighborhood.

To help you narrow down your search (there are a lot of neighborhoods out there!), I'll start off with a comparison to neighborhoods you already know. So "I live in Pittsburgh, I'm going to San Francisco, show me a neighborhood that's like Bloomfield." And then it will show you the top N most similar neighborhoods, and why they're similar.

More details in the paper and talk, but that's the idea.

Monday, March 7, 2016

Some neat things from CSCW 2016

Standard disclaimer: I saw only a slice of this conference, and probably remembered a slice of that. That said, I thought this stuff was cool:

Campus-Scale Mobile Crowd-Tasking: Deployment & Behavioral Insights by Thivya Kandappu et al
They deployed a system around their campus that would let people answer questions to help out the facilities people - is this restroom clean? is this vending machine stocked? etc. They tried out a couple different ways to group tasks. Here's what I thought was most exciting:
- well, first, that they did it at all, had 80 ppl do 800 tasks
- second, that when it came to "push" (buzz you when there's a task nearby) vs "pull" (are there any tasks here?), the "super-agents" (25% of ppl who did 80% of the work) were less efficient in the pull case, but equally efficient in the push case.

On the bias: Self-esteem biases across communication channels during romantic couple conflict by Lauren Scissors and Darren Gergle
People who have lower self-esteem are likely to use technology to talk with romantic partners during conflicts, but that tends to make them assume the worst. I mean, I suspected this, but had no real reason to think it was true - this is cool evidence.

You get who you pay for: Impact of incentives on participation bias by Gary Hsieh and Rafal Kocielnik
Lottery rewards get people who are more open-to-change. Charity rewards get people who are more self-transcendence oriented. (though usually they're less effective in getting people than fixed rewards.) Higher fixed reward: people might not care about the task as much.

"Constantly Connected" panel - Alex Pang, Gordon Bell, Melissa Mazmanian, and Mary Czerwinski talking about all the issues about being constantly-connected, for better and worse. This is a tough topic because it means a million things to a million people - and indeed, Gordon Bell seems to have been talking about something different than the other three. But Pang, Mazmanian, and Czerwinski had really interesting takes:
Pang: there's focus/concentration, then there's mind-wandering/rest. We should make space for both. Maybe our phones are eroding our capacity for focus, but maybe they're even eroding our mind-wandering.
Mazmanian: first, it's not an individual problem: "you're too stressed", "you should take a break from your phone", etc. Second, instead of "phones are good" or "phones are bad", look at the role that the smartphone is allowing you to play, and decide whether you want to be that person.
Czerwinski: we've done all this research with interruptions and context, when is it ok to interrupt something etc, but why isn't anyone using that?

Closing keynote by Mike Krieger of Instagram - just a series of straight-up things they learned building Instagram from zero to today.
- multiple identities per person -> interesting "finstas" (fake instagrams) and flexibility to express yourself in different ways
- not much follow-back pressure, really make it interest based
- require square photos because they look good and force a crop. Later relax it.
- The Future: explore the world through Instagram. That sounds fun.

Thursday, February 18, 2016

The N questions you always need to answer for any research project, all the time

Especially when it's a new project, people will always ask you a lot of questions:

- What's your Research Question? (similarly, what's your Hypothesis?)
- What's your Contribution?
- Why is it Research?
- Who will it help? (or, who cares?)
- What are you doing?
- What problem are you addressing?
- Is that a real/important problem?
- Why will this solve that problem?
- How is your work different than (any one of 1000 tangentially-related things)?
- How is it done today, and what's wrong with that?
- How will you do it?
- How will you evaluate it?

It really helps if you can answer them all, all the time. You will get instant cred and people will let you do your thing. Unfortunately, it's kind of like air bubbles in a plastic sheet: when you squeeze some of them out, then some of them reappear. Like, if you nail down "how will you do it?" then people will ask "well if you know how to do it, then why is it research?" And if you nail down "Who cares?" down to a small subset, then people will ask "Is that an important problem?"

I'm going to order them from most to least important, in my mind: (note! I am not a grant funder.)

1. What are you doing? (Please, be concise. You get one sentence. Now try it in three sentences.)
2. What problem are you addressing? -- only if it's not obvious.
3. How is your work different than (100 closely-related things)? This is an important question if someone is actually bringing up something that they think is the same thing. This is not an important question if someone is just trying to sound clever.
4. Is this problem a real problem? -- downgraded because in HCI we solve lots of non-real-problems. And it's hard to tell what's a "real problem." If you mean "is it malaria?" then no, we're not solving malaria. You can always play problem-one-upmanship, and it's usually not a fun game to play.
5. Why will your work solve this problem? -- only worth asking if it's not obvious.
6. How is it done today, and what's wrong with that? -- downgraded because usually the answer is "it's not done today."
7. How will you evaluate it? -- again, sometimes it's hard to know until you do it.

These are sometimes not worth asking but people will anyway:
... 10. What's your Research Question or Hypothesis? -- This is valid for some kinds of research, like psychology. This is less valid in the more inventor-ish types of research. People will still ask it anyway.
11. How is your work different than (900 not-really-related things)? -- Sometimes people will ask this to try to sound smart.
12. How will you do it? -- if I knew, it wouldn't be research, would it? Still, people will ask this, and it helps to be able to wave your arms.

These are often not worth asking but people will anyway:
... 100. Why is it Research? -- Ugh. Academics love to ask this. Basically, why aren't you starting a company and doing this? And "because it's goddamn hard to start a company" or "this should be done but nobody wants to pay for it" don't count.
101. What's your Contribution? -- This is a thinly veiled version of "Why Is It Research?"

But yeah, I guess if you want to be good at research, answer all of them all the time.

Monday, February 8, 2016

Welcome to Domo

EDIT: the Domo server is now shut down. If you want access to any of this data, ask me to put you in touch with Shuguan Yang and Sean Qian who are running a server that has this data now. Or, ask me about the S3 bucket that it's all stored in.

I've told this to a lot of people so I've decided to store it all in one place. This guide will range from super-basic to kinda-complicated, so apologies if it's obvious in parts, and apologies if you get lost in parts. ALSO, if you're reading this and you're not new to our group and/or server, then you may have some advice for me, and I'd appreciate it!

Domo is our Amazon Web Services server. It's named after this guy:

On Domo, we have some coordinate geotagged tweets in some cities: (all stored in PostgreSQL database "tweet")
Pittsburgh: since Jan 22, 2014. (table tweet_pgh)
SF, NY: since June 13, 2014. (tweet_sf, tweet_ny)
Houston, Cleveland, Seattle, Miami, Detroit, Chicago, London: since November 7, 2014
Minneapolis: since March 18, 2015
San Antonio, Austin, and Dallas: since June 15, 2015
(everything after SF and NY is stored in tweet_(cityname)) where cityname is lowercase, all one word)
We also have tweets in Pittsburgh beyond just coordinate-geotagged tweets, in table tweet_pgh_all.
We also have Instagrams in Pittsburgh from fall 2014 to May 2016 (when Instagram shut off access to public geotagged Instagrams.) - table instagram_pgh
And some flickr photos and other misc data sets. (not in PostgreSQL; in /data/datasets/)

We really only interact with Domo via terminal windows, so if that's not your forte, you may have some difficulty. To log in, use "ssh (your username on Domo)@(Domo's hostname)"
If you want to make it easier, you can open ~/.ssh/config and add an entry:

Host domo
Hostname (Domo's hostname)
User (your username on Domo)

We store the tweets in PostgreSQL. If you've used other SQLs, it's pretty similar, but not the same. Things to know about Postgres and our DB in particular:
  • psql tweet to connect to our database (which is called "tweet").
  • \d to list all relations (aka tables, kinda)
  • \d tablename to get more info about a certain table.
  • The tweets go in basically direct from the Twitter 1% public feed (using this script). They're all stored as text and integers except for some things that are "hstores" - basically key-value sets - and the "coordinates", which are stored using PostGIS as Points.
  • To access those Points, use some of the PostGIS functions. For example, SELECT ST_AsText to get it in a semi-readable format. ST_AsGeoJSON has been the most useful for me.
  • To query all tweets within an area: SELECT * FROM tweet_pgh WHERE ST_MakeEnvelope(-79.9, 40.44, -79.899, 40.441, 4326) ~ coordinates; 
    • (that "4326" is, for current purposes, a magic number. It means EPSG 4326/WGS-84 which is pretty much a standard for everything I do. So I always just leave it as 4326, and if you don't know better, I suggest you do too.)
Things to know about Domo:
  • Change your password right away. Do this by typing "passwd" after you SSH in.
  • Don't store things in your homedir! Our whole homedir partition only has about 8Gb. Obviously, that fills up fast. Store anything you can in /data - that has 1Tb. I might bug you sometimes to clean up your homedir if you end up using a lot of space.
When I add you to Domo, I'll tell you:
  • your username on Domo
  • your temporary password (change this as soon as possible)
  • Domo's hostname (not shown here so we get attacked as little as possible)
You should tell me:
  • if you want a username that's different than your email address, tell me ASAP and I'll create that and delete your old one.
  • your github username so I can add you to our github organization.
Dan's note to himself:
  • give the new person an account with sudo adduser username
  • give them a postgresql account (CREATE USER username;) and give them permission to read all the tables (GRANT SELECT ON ALL TABLES IN SCHEMA public TO username)
  • get their github username and give them access to the CMUChimpsLab organization too.

Tuesday, January 19, 2016

Emojis (and words) of Pittsburgh on the SUDS blog!

Hey, check it out. An article about work that Jennifer Chou and I did, on the Students for Urban Data Systems (CMU org) blog!

Saturday, December 19, 2015

First paper(s) accepted, and thesis proposal proposal proposed!

It's an exciting time here for this mid-level PhD Student.

First! I've had a paper accepted to CHI, the biggest Human-Computer Interaction conference. Paper publishing is nice. It means that people can read about what you're doing, it means four people think your work is worthwhile and well done, and it's a tangible mark of success.

It's called "Getting Users' Attention in Web Apps in Likable, Minimally Annoying Ways", the title is pretty self explanatory, and it is but one more bit of sand in the large pile of research that's been done on notification systems. Still, web sites are still not very good at this, and our paper offers one potential piece of a solution, so for that reason I am pretty happy to have done it. Baby steps.

I have Anupriya Ankolekar and especially Josh Hailpern to thank for the opportunity to intern at HP Labs, the guidance in running the study and writing the paper, and the moral support throughout the tumultuous process. Thanks, you two, for helping out a research neophyte.

Second! I've had another paper, that I helped with, accepted to CHI, and this one with so many good friends and my fiancee Tatiana, so that's cool. "Mailing Archived Emails As Postcards: Probing the Value of Virtual Collections." This one's almost four years in the making, so thanks to Tati, Dave, and Jenny for gathering and analyzing data with me; Jason, Will, and John for advice along the way, and Beka, Will, and Dave for doing the writing heavy lifting. This was pretty epic, and I'm excited we can talk about it. Finally.

(I'll post both of the papers when stuff's more officially published.)

Third! Sorry if you are feeling down about your work and this feels like rubbing it in; I have been that student very many times (and will probably continue to be in the future) and I feel ya. I don't really mean to humblebrag, or even straight-up brag, so I've resisted facebooking or twittering any of this, but I'm really pretty relieved and excited to hit this milestone, and nobody reads this blog anyway. Plus, another data point for "hang in there, keep resubmitting your drafts, you'll win this game someday" I guess?

Fourth! I'm going to graduate someday. I've got my thesis proposal proposal done (meaning, I haven't done the proposal, but I've talked with most of my committee and figured out what my thesis proposal will be). Now all I have to do is run a study or two to plan the thing, do some background research, build a giant web app, do a lot more studies, etc etc etc, but you know, it's on the right track. I've got some great profs behind me, I'm stoked about the work, and I can do most of it while I'm with Tati. So.

Saturday, May 2, 2015

Roads Greenery Buildings

What is your neighborhood made of?

We don't interact with zoning or construction in our everyday lives. We just know that some places are more pleasant than others. We don't really see the effects of dedicating half our space to parking lots and roads. We sort of know that New York is denser than suburban Ohio, but how dense is it?

More pragmatically, you may be looking for a place to live in a new city. You like your neighborhood now, so you wouldn't mind a place that "feels like" it. Obviously, midtown Manhattan won't feel like Squirrel Hill, Pittsburgh, but what neighborhood would?

Roads Greenery Buildings is an attempt to partially answer that question.

Give it an address, it will look up the place on Google Maps and Google Earth, and tell you the approximate amount of that place's nearby area that's taken up with roads, green space, and buildings. You can look up a few places to compare. Here we see that my neighborhood (the third one) has more roads and buildings than Carnegie Mellon (the first one) - which makes sense; CMU is a college campus with some big lawns. My neighborhood is also a little greener than nearby Oakland.

Here's a comparison of some neighborhoods in San Francisco, based on some coffee shops I like. Haus Coffee (the first) is in the greenest area (24th st. in the Mission is full of trees) but greenery is in short supply all around. This is to be expected; it's a big city. I was surprised to find the Ritual Hayes Valley branch (#4) to have so many roads nearby, but on reflection, there are a couple of big boulevards right there. Meanwhile, the area around Four Barrel (#2) and Saint Frank (#5) look the densest in terms of buildings.

This doesn't tell you everything, of course. The space calculations are imperfect, and there's no description of what the green space is (a highway median is less good than a nice park) or what the buildings are (a parking garage, a house, and an office skyscraper all get the same weight). But it's a start. I think of this (or, you know, the platonic ideal of this) as a peer to Walkscore: by no means the only tool that helps you understand a place, but one of many.

What's good? Depends on you, I guess, but I think this tool shows how places with more buildings tend to be more approachable and interesting, while green space often just makes things farther apart.

Try it out! (disclaimer: link worked as of May 2015; apologies if it's rotted since then.)

Hat tip to Andrew Alexander Price for the blog post that inspired this work. (More details.)

Monday, April 27, 2015

Thinking about metrics

Reading about effective altruism, the Open Philanthropy Project, GiveWell, etc, and thinking "good lord, how can they possibly hope to put a number on what's The Best Thing to do with your money?" It feels like they're taking a (to use the one design concept I sort of understand) wicked problem and trying to make it tame. Usually this doesn't go well; as the Vox article above hinted, you often end up only representing a couple of viewpoints, or making it worse by playing whack-a-mole by iteratively solving whatever problem you're thinking about at the moment.

But, I've got reason to assume, based on what GiveWell's done so far, that at least some of the Open Phil people are thinking about it in this broad sense.

And metrics aren't all bad! I think about WalkScore, which is limited and flawed, but is still a pretty solid and useful indicator of how nice it is to live in a place. And really, the thing is, we're often making decisions by metrics anyway, and often those metrics are suuuper flawed. Like GDP. So if I'm reading about a Social Progress Index on TED, sure, it might be a TED blowhard with another half-assed idea, but it doesn't have to be that good. I'd love to start talking about SPI instead of GDP, not because it's great, but because it's better.