##About

This is a tool to generate dictionaries based on twitter searches, inspired by this blog.

Compared to the original blog, this version has a few differences,

  1. Uses Jansson and jshon for json parsing
  2. Sorts output by number of occurrences, making it head-friendly for cropping the text
  3. Takes all searchwords in one go
  4. Uses a dictionary of ‘stop-words’ - currently a swedish one is used

##Example usage

$./tweetsearch.sh "sveriges riksdag" "svenska regeringen" | head -n20
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 77594  100 77594    0     0  64952      0  0:00:01  0:00:01 --:--:-- 79665
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 41749  100 41749    0     0  45874      0 --:--:-- --:--:-- --:--:-- 61758
sveriges
riksdag
svpol
bjöd
regeringen
svenska
kritiserad
judehatare
kaplan
mehmet
riksdagsislamist
yvonne
ridley
ytterligare
mehmetkaplan
miljöpartiet
jerlerup
värre
soppan
riksdagen

##Requirements

  1. Jshon
  2. Jansson

##Code

It’s not very advanced or special in any way:

#/!bin/bash

if [ $# -eq 0 ]
then
  echo "Usage: `basename $0` term1 term2"
  exit 65;
fi
tmpfile=$(mktemp)
resultsfile=$(mktemp)
#echo "Searching for the following terms:"
for term in "$@"; do
#for i in $*; do
	#echo "Performing search for $term"
	curl -G --data-urlencode "q=$term"  --data-urlencode "rpp=500" "http://search.twitter.com/search.json" > $tmpfile

	# Handle the data
	# jshon -e results -a -e text < $tmpfile 	- Extract tweet content, 
	# cut -d"\"" -f2		 -	remove quotes
	# tr " " \\n 			- convert space to linebreak
	# sed s/\^\#//g|		- remove #-char from beginng of line
	# tr '[A-Z]' '[a-z]'	- make lowercase
	# sed s/\^\@//g 		- remove twitternames
	# grep -v "^http://" 	- remove links
	# tr -d ":,.!?"			- remove some other chars
	jshon -e results -a -e text < $tmpfile | cut -d"\"" -f2| tr " " \\n| sed s/\^\#//g| tr '[A-Z]' '[a-z]'| sed s/\^\@//g| grep -v "^http://"| tr -d ":,.!?" >> $resultsfile;
done
#echo "Sorting, uniquifying and showing results"
# 	Sort, 
#	'uniq -c' 	order by num occurrence, 
#	'sort -nr'	sort by num occurence, 
#	'cut -c9-'	remove occurence-num, 
#	'grep -vwx --file=swedish_stopwords'	remove stop-words
#	'grep -v "^$\|^.$\|^..$"'	remove empty lines, one-letter words and two-letter words
sort $resultsfile| uniq -c| sort -nr|  cut -c9-|  grep -vwx --file=swedish_stopwords| grep -v "^$\|^.$\|^..$"
#jshon -e results -a -e text <  search.json

2013-04-13

Source code repos:

 
comments powered by Disqus