A grabbag of commands
I need to be fluent with the JSON parsing tool jq. Very often I have to interact with JSON data: When I interact with APIs, when parsing scraped Twitter data, when doing SPARQL queries…
Here I’ve saved some of my own mini-scripts and shortcuts that I have developed for my own needs. Most of it is already available in similar form, on the internet, but it is not available in a form that I found useful. By writing it down, extending and re-writing it I can better commit this code to my own long-term memory. So I hope.
Snippets represented as Gist
I am not sure yet what will be the best textual format or representation for saving my codesnippets. I have created a gist on github.com where I’ve saved some jq
commands:
Snippets not yet stored as gist
The following text snippets are a bit longer. They emerged by entering commands on the Unix command line. This was a trial and error process. Some script output is also given. .
The format of the snippets below is a pipe of shell commands.
<infile.json jq-cmd1 | jq-cmd2 | shell-cmd1 | shell-cmd2 ...
The infile.json
contains Twitter data that I’ve saved beforehand using a small python script.
The infile contains one Tweet per line. The file as a whole is not a valid JSON file. For this reason I need the first jq
command, jq-cmd1
. It converts the collection of JSON fragments into an array of JSON objects and applies a transformation, or does some filtering. This creates another sequence of JSON fragments.
The second jq
command converts the transformed JSON into a csv file. This jq-cmd is so complex that I’ve factored it out into a small file in the same directory, called to_csv.jq
(see at end of this post).
< simplefile.json in2csv > csv.outcsvkit (code) is a collection of commandline tools that contains
csv2arff csvcut csvgrep csvjson csvpy csvsql csvstat csvclean csvformat csvjoin csvlook csvsort csvstack in2csv sql2csv
Note to self: Get this list with these bash commands
sudo apt install apt-file
apt-file update
apt-file list python3-csvkit
jq Module to_csv.jq
looks like this:
### (insert Gist or scriptcode here)
Instead of the complex to_csv … TODO: (explain difference of to_csv.jq
and jq’s built-in @csv
)
Example 1 - extract tweet text
This snippet extracts the text from a file that I created by saving data from the Twitter streaming API. I’ve filtered for the “#rstats
” hashtag which includes tweets about the R language:
< raw_tweets/stream__AGU17.json jq -s '[.[] | {text: .text}] ' \
| jq -f -r to_csv.jq | grep -i "rstats" | sort -u
Example 2 - extract more than 1 field from Tweets
This is a variant of Example 1. Extract 5 fields from different levels of the tweet into a csv file with brief column names.
<raw_tweets/stream_rstats.json jq -s -S '[.[] | \
{name: .user.name, sn: .user.screen_name, loc: .user.location, \
url:.user.url, desc:.user.description, _fol: .user.followers_count }]' | \
jq -f -r to_csv.jq
Helper function to_csv()
This is from a Stackoverflow Answer to the question How to convert arbirtrary simple JSON to CSV using jq?
def tocsv($x):
$x
|(map(keys)
|add
|unique
|sort
) as $cols
|map(. as $row
|$cols
|map($row[.]|tostring)
) as $rows
|$cols,$rows[]
| @csv;
tocsv(.)
# call:
# <tweets.json jq -s -S '[.[] | {name: .user.name, sn: .user.screen_name, loc: .user.location, url:.user.url, desc:.user.description, _fol: .user.followers_count }]' | jq -f -r to_csv.jq
The other answers are worth a look, too.
Further processing
Further processing on the command line can be done with the csvkit suite of tools.
This is an example. (csvsort, csvcut, csvlook correspond to shell-cmd1
mentioned above) :
<raw_tweets/stream_rstats.json jq -s -S \
'[.[] | {name: .user.name, sn: .user.screen_name, \
loc: .user.location, \
url:.user.url, \
desc:.user.description, \
_fol: .user.followers_count }]' | \
jq -f -r to_csv.jq | \
csvsort -H -c 1 | csvcut -c 5,2 -l | csvlook --max-column-width 30 --max-rows 10
It produces this output:
(This table is not very informative, but that’s intentional. The content of the tweets is irrelevant here).
| line_number | f | c |
| ----------- | ------------------------------ | -------------- |
| 1 | | 日本 東京 |
| 2 | | Milan |
| 3 | | Grand Ledge |
| 4 | | Iowa City, IA |
| 5 | | Pennsylvania |
| 6 | | Columbus, OH |
| 7 | | Columbus, OH |
| 8 | http://es.linkedin.com/in/j... | Madrid - Spain |
| 9 | | |
| 10 | http://personal.tcu.edu/kyl... | Fort Worth, TX |
| ... | ... | ... |
Example 3 - Extract follower-count, username, description, location
Extract .user.name, .user.screen_name, .user.location, url:.user.url, .user.description, .user.followers_count
, but display only follower-count, name, description, location
sorted by followercount descending
<raw_tweets/stream_rstats.json jq -s -S '[.[] | {name: .user.name, sn: .user.screen_name, loc: .user.location, url:.user.url, desc:.user.description, _fol: .user.followers_count }]' | jq -f -r to_csv.jq | csvcut -c 1,4,2,3 | csvsort -c 1 -r | uniq | more
Output:
_fol,name,desc,loc
387997,"Craig Brown, Ph.D.","Technology & #Business #Consultant (#techpreneur), Dad, #Entrepreneur, #STEM, #Philanthropist, #CancerSurvivor + #BigData SME http://www.craigbrownphd.com","Houston, TX"
83310,Kirk Borne,"The Principal Data Scientist at @BoozAllen, PhD Astrophysicist, ♡ Data Science, Top Big Data Influencer. Ex-Professor http://rocketdatascience.org/",Booz Allen Hamilton
83296,Kirk Borne,"The Principal Data Scientist at @BoozAllen, PhD Astrophysicist, ♡ Data Science, Top Big Data Influencer. Ex-Professor http://rocketdatascience.org/",Booz Allen Hamilton
33681,R-bloggers,Twitting blog posts from the R blogosphere,
33671,R-bloggers,Twitting blog posts from the R blogosphere,
30097,Hadley Wickham,"R, data, visualisation.","Houston, TX"
30096,Hadley Wickham,"R, data, visualisation.","Houston, TX"
30094,Hadley Wickham,"R, data, visualisation.","Houston, TX"
30074,Hadley Wickham,"R, data, visualisation.","Houston, TX"
30072,Hadley Wickham,"R, data, visualisation.","Houston, TX"
20738,BIconnections,Networking | connecting | people | technology | #process | #Business #Intelligence | #Information #Management | #data | learning discussions | #Community,Belgium
20731,BIconnections,Networking | connecting | people | technology | #process | #Business #Intelligence | #Information #Management | #data | learning discussions | #Community,Belgium
20254,Bill Nigh,Citizen/US Army Vet/Con
These are the people with the most followers. Their follower count has increased while I was scraping the data from the Twitter Streaming API.
Example 4 - Extract Users and those they retweeted or mentioned in text
This example was a preprocessing step for this blogpost: Constructing a tiny social-network visualization from the #Rstats hashtag.
This command gives a tabular output to the question “Who mentions whom?”. The data can be used to create a social-network visualisation.
<raw_tweets/stream_rstats.json jq -s '[.[] | { _name: .user.name, _screen_name: .user.screen_name, _id:.user.id, mentions_id: .entities.user_mentions[].id, mentions_name: .entities.user_mentions[].name, mentions_screen_name: .entities.user_mentions[].screen_name}]' | jq -f -r to_csv.jq | grep -P '^"' | head
## bash: raw_tweets/stream_rstats.json: No such file or directory
## jq: Could not open to_csv.jq: No such file or directory
Output:
(Only ten rows are shown)
"_id","_name","_screen_name","mentions_id","mentions_name","mentions_screen_name"
"622262193","ReactJS News","ReactJS_News","245217900","timelyportfolio","timelyportfolio"
"151588786","Alex Bertram","bedatadriven","552767357","Maarten-Jan Kallen","mj_kallen"
"1553414406","Lovedeep Gondara","LovedeepSG","144592995","R-bloggers","Rbloggers"
"4277617239","Javascript Bot","JavascriptBot_","622262193","ReactJS News","ReactJS_News"
"4277617239","Javascript Bot","JavascriptBot_","622262193","ReactJS News","timelyportfolio"
"4277617239","Javascript Bot","JavascriptBot_","622262193","timelyportfolio","ReactJS_News"
"4277617239","Javascript Bot","JavascriptBot_","622262193","timelyportfolio","timelyportfolio"
"4277617239","Javascript Bot","JavascriptBot_","245217900","ReactJS News","ReactJS_News"
"4277617239","Javascript Bot","JavascriptBot_","245217900","ReactJS News","timelyportfolio"