Data Vis Tutorial: Open data queries using SODA and SoQL.

Background

 

What is Socrata?

Socrata is a Seattle-based company, originally founded as Blist in 2007, that has engineered the platform for a number of open government databases, including that of New York, Chicago, Baltimore, the White House, and a number of federal agencies. Socrata catalogs all its open datasets here.
It also hosts community hackathon sites here.
SODA = Socrata Open Data API
SoQL = Socrata Query Language

Why are we learning Socrata tools?

Socrata platforms are the gateway to a number of government datasets you may want to use. I also like that the platforms are well-structured, relatively well-documented, and offer an SQL-like query language that can be a good preview to using MySQL and other database packages.

Do I always have to use SoQL to get Socrata data?

Not at all! Sometimes a dataset is small enough that you can download it through the Socrata web UI. But in the case that you are dealing with massive datasets like the NYC 311 callbase, you will need to be able to use SoQL to get exactly what you want (unless you want to wait for hours and days to download massive datasets).

Getting Started: NYC Open Data

Please peruse the Socrata Open Data API Docs to understand the following NYC Open Data queries. I wanted to query this month’s 311 noise complaints in in my neighborhood (zip code 11231), so I obtained the dataset’s API endpoint, then I modified the format extension to output CSVs instead of JSON, and finally I added filter and query parameters to obtain this month’s noise complaints in 11231.

Try pasting the following queries as URLs in your browser; for each query, a small a CSV will download to your computer.

  • First, try using a simple filter to get some 311 calls in 11231. The default number of records is 1000, and the default start date is in 2010.
https://data.cityofnewyork.us/resource/erm2-nwe9.csv?incident_zip=11231
  • I want to start making more complex queries to obtain more recent records just from my neighborhood, so I changed the syntax to the SoQL format.
https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$where=incident_zip='11231'
  • Then I formed a query to increase the output to 10,000 records:
https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$where=incident_zip='11231'&$limit=10000
  • And a query for records created on or after September 1 2015:
https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$where=created_date >='2015-09-01T00:00:00'
  • And a query for noise complaints only:
https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$where=starts_with(complaint_type,'Noise')
  • Finally, I combined all the previous queries to get exactly what I wanted.
https://data.cityofnewyork.us/resource/erm2-nwe9.csv?$where=starts_with(complaint_type,'Noise') AND created_date >='2015-08-01T00:00:00' AND incident_zip='11231'

 

Now Try It!

Expand the NYC Open Data exercise by looking at December 2014, the month of protests around Eric Garner’s death. Form a query that outputs a CSV with the following attributes:

  • data source: NYC Open Data
  • time period: December 2014
  • CSV size: smaller than 1 MB

Start Visualizing

If you have some visualization experience, you can use the Socrata examples to help you turn your new dataset into a visualization. This will be due in two weeks. If you aren’t there yet, give it a try. I’ll post a brief tutorial on visualizing your data next week.

And a Handy Thing

You will probably be using Excel, Google Sheets, Numbers, or another spreadsheet tool to view your data. It can be very handy–especially when you’re annotating your vis or writing a report on it–to know how to put together basic formulas using spreadsheet functions. Since we all have access to Google via our NYU addresses, here are some basic how-tos on putting together a formula using Google Sheet functions: