Friday, May 19, 2017

State of the Geotags: Motivations and Recent Changes

... is the title of our recent paper at ICWSM 2017, "our" being me, Zichen Liu, Alex Sciuto, and all of our kind advisor Jason Hong.

The paper, in one sentence: think of geotagged social media posts (tweets, instagrams, etc) as postcards, not as ticket stubs; as conscious choices, not unconscious byproducts.

In three bullet points:
- when people are posting something to Twitter/Flickr/etc, they usually consciously choose to add their location; it's not a "set it and forget it" situation. (we found this out by analyzing how often people toggle between geotagging and not.)
- when people post their location, they usually do it from unusual or faraway places. They usually don't do it at home or in their neighborhood, and they usually don't do it from places that they go to regularly. (this is from surveys.)
- people are posting their specific location less than they used to. Some of this might be privacy; a lot of it is because Twitter changed the defaults on how your location is posted.

In more detail:
Paper Slides

Friday, March 31, 2017

Getting from Zero to What I Do Most Of The Time With Data

We've been getting a lot of undergrads and master's students coming on board in our lab, with pretty vastly different levels of experience. That is good! More diversity, the better, I say.

However, it tends to be hardest for those with the least experience. Often I'll say something like "just ssh in to the server, connect to our postgres database, and get all the tweets in this area." And they'll be like "oops I guess I was supposed to know what ssh and postgres are, but I don't, so now I'm either trying to bluff or googling frantically." Which is too bad! I think what they should do here is ask me for advice, but they don't know that. They might think that I'm either a jerk who would make fun of them for not knowing that much, or that I'm a person with wayyy too many responsibilities to possibly give them the time they need (i.e. a professor).

I do appreciate their concern for my time, though, and it's probably more fun for them to learn things themselves (go at your own pace, etc), so I've put together this list of useful guides.

Unix Computing Basics
How to Unix (mac) - work through Conquering the Command Line (chapter 1)
To get started with this, open up the program "Terminal" on your mac. You can do that by going to the magnifying glass in the top right and typing "terminal."
How to get a Unix-ish prompt on Windows - I don't actually know. Someone suggest me a tutorial for this.
How to Unix (Linux) - despite this being the year of the Linux desktop, few people have one. If you do, open up a terminal however you do. I used to run Ubuntu and it made that pretty easy.
SSH (to connect to a remote server and navigate around there) - the "basic syntax" part is fine. You probably won't need the "keys" bit but it might be fun if you want to look around later.
SFTP (if you want to download a file from a remote server)
git: try Software Carpentry's git novice course. (parts 1-9 especially.) All of Software Carpentry's stuff seems pretty good.

Vim and other Text Editing
You should probably know at least the basics of Vim, because it's installed on every computer ever, and you always need to edit text files. Also, sometimes you'll end up in vim for some reason and it's good to be able to quit. Learn Enough's text editor class (at least chapter 1, Vim) seems like a good place to start.

Python
Software Carpentry has a good lesson here too, specialized for research computing.
For more general python, or if you are starting from zero programming, you might have more luck with Learn Python The Hard Way (for everything; long, but you can breeze through the parts you already know.)
pip and virtualenv. Generally you should make a virtualenv for each project. Stuff may work without it, but then you've got global dependencies (so if you need module A to be v3.0, but you update it to 4.0 for some other project, your old thing that's still expecting it to be 3.0 may stop working. virtualenv gives you a separate copy of module A for each project that needs it). You might see some places recommending you use conda instead; it's fine too, I have less experience with it, but it'll probably get you where you need to go.
The csv module is particularly useful, here is a guide for it. As is the Argparse module; tutorial here.
If you need to send out HTTP requests, use the requests module.

PostgreSQL - this is an ok tutorial. Part VII may be more complicated than you need. I'd love to see a better tutorial too. SQLBolt may be this better tutorial.
A bit about PostGIS - postGIS is a library that lets you use geo data in your PostgreSQL database somewhat sanely. You can probably skip most of this. SELECT * FROM tweet_pgh WHERE ST_MakeEnvelope(-79.9, 40.44, -79.899, 40.441, 4326) ~ coordinates; is probably what you need.

There's probably a lot more useful stuff I could put here! Let me know if you've got anything I should add. Also, tell me if you have feedback, good or bad, on any of these.