6 min read

Learning jq

A grabbag of commands

Some notes about learning **jq**, the lightweight and flexible command-line JSON processor. I'll extend and improve this blog post in the next months, in Q1/2018. This page is a cheatsheet for my own use; it has not the purpose to teach anyone.

I need to be fluent with the JSON parsing tool jq. Very often I have to interact with JSON data: When I interact with APIs, when parsing scraped Twitter data, when doing SPARQL queries…

Here I’ve saved some of my own mini-scripts and shortcuts that I have developed for my own needs. Most of it is already available in similar form, on the internet, but it is not available in a form that I found useful. By writing it down, extending and re-writing it I can better commit this code to my own long-term memory. So I hope.

Snippets represented as Gist

I am not sure yet what will be the best textual format or representation for saving my codesnippets. I have created a gist on github.com where I’ve saved some jq commands:

Note to self: Maybe there is a better way to store codesnippets? Also better convert the snippets below to a gist? (But you do not really put script output into a gist...)

Snippets not yet stored as gist

The following text snippets are a bit longer. They emerged by entering commands on the Unix command line. This was a trial and error process. Some script output is also given. .

The format of the snippets below is a pipe of shell commands.

<infile.json jq-cmd1 | jq-cmd2 | shell-cmd1 | shell-cmd2 ...

The infile.json contains Twitter data that I’ve saved beforehand using a small python script.

The infile contains one Tweet per line. The file as a whole is not a valid JSON file. For this reason I need the first jq command, jq-cmd1. It converts the collection of JSON fragments into an array of JSON objects and applies a transformation, or does some filtering. This creates another sequence of JSON fragments.

The second jq command converts the transformed JSON into a csv file. This jq-cmd is so complex that I’ve factored it out into a small file in the same directory, called to_csv.jq (see at end of this post).

NOTE/TODO If the python programs in the csvkit are installed, you can also use:
 < simplefile.json in2csv > csv.out 
csvkit (code) is a collection of commandline tools that contains
csv2arff   csvcut     csvgrep    csvjson    csvpy      
csvsql     csvstat    csvclean   csvformat  csvjoin    
csvlook    csvsort    csvstack   in2csv    sql2csv

Note to self: Get this list with these bash commands

sudo apt install apt-file
apt-file update
apt-file list python3-csvkit

jq Module to_csv.jq looks like this:

### (insert Gist or scriptcode here)

Instead of the complex to_csv … TODO: (explain difference of to_csv.jq and jq’s built-in @csv)

NOTE end TODO

Example 1 - extract tweet text

This snippet extracts the text from a file that I created by saving data from the Twitter streaming API. I’ve filtered for the “#rstats” hashtag which includes tweets about the R language:

< raw_tweets/stream__AGU17.json jq -s '[.[] | {text: .text}] ' \
   | jq  -f -r to_csv.jq  | grep -i "rstats" | sort -u

Example 2 - extract more than 1 field from Tweets

This is a variant of Example 1. Extract 5 fields from different levels of the tweet into a csv file with brief column names.

<raw_tweets/stream_rstats.json jq  -s  -S  '[.[] | \
   {name:  .user.name, sn: .user.screen_name, loc: .user.location, \
   url:.user.url, desc:.user.description, _fol: .user.followers_count }]'   | \
   jq  -f -r to_csv.jq

Helper function to_csv()

This is from a Stackoverflow Answer to the question How to convert arbirtrary simple JSON to CSV using jq?


def tocsv($x):
    $x
    |(map(keys)
        |add
        |unique
        |sort
    ) as $cols
    |map(. as $row
        |$cols
        |map($row[.]|tostring)
    ) as $rows
    |$cols,$rows[]
    | @csv;

tocsv(.)

# call: 
# <tweets.json jq  -s  -S  '[.[] | {name:  .user.name, sn: .user.screen_name, loc: .user.location, url:.user.url, desc:.user.description, _fol: .user.followers_count }]'  | jq  -f -r to_csv.jq

The other answers are worth a look, too.

Further processing

Further processing on the command line can be done with the csvkit suite of tools.

This is an example. (csvsort, csvcut, csvlook correspond to shell-cmd1 mentioned above) :

<raw_tweets/stream_rstats.json jq  -s  -S \
'[.[] | {name:  .user.name, sn: .user.screen_name, \
loc: .user.location, \
url:.user.url, \
desc:.user.description, \
_fol: .user.followers_count }]'   | \
jq  -f -r to_csv.jq    | \
 csvsort -H -c 1    | csvcut -c 5,2 -l | csvlook --max-column-width 30 --max-rows 10

It produces this output:

(This table is not very informative, but that’s intentional. The content of the tweets is irrelevant here).

| line_number | f                              | c              |
| ----------- | ------------------------------ | -------------- |
|           1 |                                | 日本 東京      |
|           2 |                                | Milan          |
|           3 |                                | Grand Ledge    |
|           4 |                                | Iowa City, IA  |
|           5 |                                | Pennsylvania   |
|           6 |                                | Columbus, OH   |
|           7 |                                | Columbus, OH   |
|           8 | http://es.linkedin.com/in/j... | Madrid - Spain |
|           9 |                                |                |
|          10 | http://personal.tcu.edu/kyl... | Fort Worth, TX |
|         ... | ...                            | ...            |

Example 3 - Extract follower-count, username, description, location

Extract .user.name, .user.screen_name, .user.location, url:.user.url, .user.description, .user.followers_count, but display only follower-count, name, description, location sorted by followercount descending

<raw_tweets/stream_rstats.json jq  -s  -S  '[.[] | {name:  .user.name, sn: .user.screen_name, loc: .user.location, url:.user.url, desc:.user.description, _fol: .user.followers_count }]'   | jq  -f -r to_csv.jq     | csvcut -c 1,4,2,3 | csvsort -c 1 -r  | uniq | more 

Output:

_fol,name,desc,loc
387997,"Craig Brown, Ph.D.","Technology & #Business #Consultant (#techpreneur), Dad, #Entrepreneur, #STEM, #Philanthropist, #CancerSurvivor + #BigData SME http://www.craigbrownphd.com","Houston, TX"
83310,Kirk Borne,"The Principal Data Scientist at @BoozAllen, PhD Astrophysicist, ♡ Data Science, Top Big Data Influencer. Ex-Professor http://rocketdatascience.org/",Booz Allen Hamilton
83296,Kirk Borne,"The Principal Data Scientist at @BoozAllen, PhD Astrophysicist, ♡ Data Science, Top Big Data Influencer. Ex-Professor http://rocketdatascience.org/",Booz Allen Hamilton
33681,R-bloggers,Twitting blog posts from the R blogosphere,
33671,R-bloggers,Twitting blog posts from the R blogosphere,
30097,Hadley Wickham,"R, data, visualisation.","Houston, TX"
30096,Hadley Wickham,"R, data, visualisation.","Houston, TX"
30094,Hadley Wickham,"R, data, visualisation.","Houston, TX"
30074,Hadley Wickham,"R, data, visualisation.","Houston, TX"
30072,Hadley Wickham,"R, data, visualisation.","Houston, TX"
20738,BIconnections,Networking | connecting | people | technology | #process | #Business #Intelligence | #Information #Management | #data | learning discussions | #Community,Belgium
20731,BIconnections,Networking | connecting | people | technology | #process | #Business #Intelligence | #Information #Management | #data | learning discussions | #Community,Belgium
20254,Bill Nigh,Citizen/US Army Vet/Con

These are the people with the most followers. Their follower count has increased while I was scraping the data from the Twitter Streaming API.

Example 4 - Extract Users and those they retweeted or mentioned in text

This example was a preprocessing step for this blogpost: Constructing a tiny social-network visualization from the #Rstats hashtag.

This command gives a tabular output to the question “Who mentions whom?”. The data can be used to create a social-network visualisation.

<raw_tweets/stream_rstats.json jq  -s    '[.[] | { _name:  .user.name, _screen_name: .user.screen_name, _id:.user.id, mentions_id: .entities.user_mentions[].id, mentions_name: .entities.user_mentions[].name, mentions_screen_name: .entities.user_mentions[].screen_name}]' | jq -f -r to_csv.jq |  grep -P  '^"' | head
## bash: raw_tweets/stream_rstats.json: No such file or directory
## jq: Could not open to_csv.jq: No such file or directory

Output:

(Only ten rows are shown)

"_id","_name","_screen_name","mentions_id","mentions_name","mentions_screen_name"
"622262193","ReactJS News","ReactJS_News","245217900","timelyportfolio","timelyportfolio"
"151588786","Alex Bertram","bedatadriven","552767357","Maarten-Jan Kallen","mj_kallen"
"1553414406","Lovedeep Gondara","LovedeepSG","144592995","R-bloggers","Rbloggers"
"4277617239","Javascript Bot","JavascriptBot_","622262193","ReactJS News","ReactJS_News"
"4277617239","Javascript Bot","JavascriptBot_","622262193","ReactJS News","timelyportfolio"
"4277617239","Javascript Bot","JavascriptBot_","622262193","timelyportfolio","ReactJS_News"
"4277617239","Javascript Bot","JavascriptBot_","622262193","timelyportfolio","timelyportfolio"
"4277617239","Javascript Bot","JavascriptBot_","245217900","ReactJS News","ReactJS_News"
"4277617239","Javascript Bot","JavascriptBot_","245217900","ReactJS News","timelyportfolio"