Perl Script to Insert DBpedia Infobox Data into a MySQL Database

I’m in the middle of a project involving Wikipedia and ResearchCyc, and I needed to get the data contained in DBpedia’s Infobox RDF Triple file into a MySQL database so I could use it in conjunction with the database created by the excellent Wikipedia Miner.

This script parses out the Wikipedia page, DBpedia Infobox Predicate and Infobox subject, and inserts them into a MySQL table. I thought I’d share it with The Internet in case someone else wanted to work with DBpedia infobox data in the same way.

One ends up with a table filled with entries like:

+----------+-----------------+---------------+------------------------------------------+
| id       | predicate       | subject       | page                                     |
+----------+-----------------+---------------+------------------------------------------+
| 22378459 | foundationDate  | 1918-10-18    | Fortaleza_Esporte_Clube                  | 
| 21460098 | areaCode        | 07404         | Bösingen,_Baden-Württemberg              | 
| 21536062 | demonym         | Caronelli     | Carona_(BG)                              | 
| 23913919 | draftyear       | 1991          | Mike_Pritchard                           | 
| 22278400 | producer        | Bow_Wow       | Face_Off_(Bow_Wow_&_Omarion_album)       | 
| 22320735 | bodyStyle       | Coup          | Fiat_1200                                | 
| 22245790 | classis         | Magnoliopsida | Euphorbia_atropurpurea                   | 
| 23710420 | order           | Sapindales    | Malleastrum_leroyi                       | 
| 25463380 | starring        | Lena_Headey   | The_Contractor                           | 
| 26021756 | latitudeMinutes | 27            | Walnut_Township,_Atchison_County,_Kansas | 
+----------+-----------------+---------------+------------------------------------------+

And a table containing the frequencies of each predicate:

+-------------------------+-----------+
| predicate               | frequency |
+-------------------------+-----------+
| background              |     30863 | 
| homeTown                |     49083 | 
| genre                   |    245669 | 
| label                   |    132028 | 
| associatedBand          |     35371 | 
| associatedMusicalArtist |     35371 | 
| artist                  |     98762 | 
| producer                |     96997 | 
| reviews                 |     35207 | 
| recordDate              |     24784 | 
+-------------------------+-----------+

I used the DBpedia infobox data from this post: “DBpedia – Rethinking Wikipedia infobox extraction“, kindly provided by Georgi Kobilarov. The data linked to in that post is much more suitable for my requirements than previously available DBpedia infobox data – instead of multiple predicates for birth date – birthDate, dateBirth, dateOfBirth, birth…, they’ve been mapped to dbpedia:Person#birthdate. Wonderful!

To handle the occasional /uXXXX, I used Unicode::String and Unicode::Escape, by Gisle Aas and Hitoshi Amano respectively.

The script could be improved: though the Wikipedia Pages and DBpedia Infobox Predicates are fine, some of the subjects are rather … interesting. As I’m currently only interested in the 200 most frequently occuring predicates, I haven’t put more time into smoothing out some of the more interesting subject data. If anyone makes changes to the script, please let me know and I’ll update this post.

Here it is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
#!/usr/bin/perl
 
###################################################################################
#
#	Parse out the Wikipedia page title, the DBpedia Infobox Relation and the subject from the infobox.nt file.
#
#	Inserts data into a MySQL database tables with the structures:
#	
#	infoboxes
#
#	+-----------+--------------+------+-----+---------+----------------+
#	| Field     | Type         | Null | Key | Default | Extra          |
#	+-----------+--------------+------+-----+---------+----------------+
#	| id        | int(11)      | NO   | PRI | NULL    | auto_increment | 
#	| predicate | varchar(255) | YES  |     | NULL    |                | 
#	| subject   | varchar(255) | YES  |     | NULL    |                | 
#	| page      | varchar(255) | YES  |     | NULL    |                | 
#	| type      | int(1)       | YES  |     | NULL    |                | 
#	+-----------+--------------+------+-----+---------+----------------+
#
#	predicateFrequency
#
#	+-----------+--------------+------+-----+---------+-------+
#	| Field     | Type         | Null | Key | Default | Extra |
#	+-----------+--------------+------+-----+---------+-------+
#	| predicate | varchar(255) | NO   | PRI |         |       | 
#	| frequency | int(6)       | YES  |     | NULL    |       | 
#	+-----------+--------------+------+-----+---------+-------+
#
#	
#	To output to a CSV file, uncomment lines beginning with "#@#"
#	
#	DBpedia infobox.nt downloaded from: http://blog.georgikobilarov.com/2008/10/dbpedia-rethinking-wikipedia-infobox-extraction/
#
#	Script written by Michael Robinson - faceleg - http://pagesofinterest.net/
#
#	Contact: mike@pagesofinterest.net
#
####################################################################################
 
use strict;
use warnings;
use DBI;
use DBD::mysql;
 
####################################################################################
#
#	Unicode: 		http://search.cpan.org/~gaas/Unicode-String-2.09/String.pm
#	Unicode::Escape:	http://search.cpan.org/~itwarrior/Unicode-Escape-0.0.2/lib/Unicode/Escape.pm
#
####################################################################################
use Unicode::Escape qw(escape unescape);
 
#Database details
my $username = '';
my $pass = '';
my $db = 'wikipedia';
my $querystring;
 
#connect to database
my $dbh = DBI->connect( "dbi:mysql:$db", $username, $pass, { 'PrintError' => 1, 'RaiseError' => 1 } );
 
my $lost = 0;	# amount of entries that are "lost" due to crazy formatting.
 
#File & directory declarations
my $dir = "/path/to/":
my $DBpedia = "infobox.nt";
#@#my $CSV = "cleanedRDF.csv";
 
###################################################################################
#
#	Decode URL encodings	
#
#	thanks to: http://glennf.com/writing/hexadecimal.url.encoding.html
#
###################################################################################
sub URLDecode {
	my $theURL = $_[0];
	$theURL =~ tr/+/ /;
	$theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;
	$theURL =~ s/<!–(.|\n)*–>//g;
	return $theURL;
}
 
###################################################################################
#
#	If a string begins with 0 or more white spaces and ", and ends with " and 0 or more white spaces, remove the white spaces and ".	
#
###################################################################################
sub deQuote {
	my $deQuoted = $_[0];
	if($deQuoted =~ /^\s*".*"\s*$/){
		$deQuoted =~ /\s*"(.*)"\s*$/;
		return $1;
	}
	else{ return $deQuoted; }
}
###################################################################################
#
#	Inserts data into MySQL table
#
###################################################################################
sub doInsert {
	if($3 && $3 !~ /^\s*$/ && $3 ne ""){	#make sure the subject isn't empty... in theory - doesn't seem to work for radio station: frequency...?!
		#insert triple into infoboxes table
		$querystring = "insert into infoboxes (page,predicate,subject) values(";
		$querystring.= $dbh->quote(URLDecode($1)).",".$dbh->quote(deQuote(URLDecode($2))).",".$dbh->quote(deQuote(unescape(URLDecode($3)))).")";
		$dbh->do($querystring);
 
		#insert or update data in predicateFrequency table
		my $predicate = $dbh->quote(deQuote(URLDecode($2)));
 
		$querystring = "INSERT INTO predicateFrequency(predicate,frequency) VALUES($predicate,0) ON duplicate KEY UPDATE frequency=frequency+1";
		$dbh->do($querystring);
	}		
}
# go through dbpedia infobox relations
open (IB, $dir.$DBpedia);
#@#open (CSV,'<'.$dir.$DBpedia);
 
while(<IB>) {
 
	chomp;	#offer a byte sized sacrifice to the great compiler
 
	###################################################################################	
	# 
	#	Wikipedia page title should not be dequoted!
	#	
	#	Split into elements 
	# 	$1 = Wikipedia Page
	# 	$2 = Predicate
	# 	$3 = Object
	#
	###################################################################################
	#
	#
	#	standard triple, no extra screwing about required.
	#
	###################################################################################
   	if($_ =~ /<http:\/\/dbpedia\.org\/resource\/(.+)>\s<http:\/\/dbpedia.org\/ontology\/.*#(.+)>\s<http:\/\/dbpedia.org\/resource\/(.*)>.*/){ 
		#print URLDecode($1).'	'.URLDecode($2).'	'.unescape(URLDecode($3));
		#@#print CSV '"'.URLDecode($1).'","'.URLDecode($2).'","'.URLDecode($3).'"';
		doInsert($1,$2,$3);
	}	
	###################################################################################
	#
	#	standard triple with date, remove trailing declaration.
	#
	###################################################################################	
	elsif($_ =~ /<http:\/\/dbpedia\.org\/resource\/(.+)>\s<http:\/\/dbpedia.org\/ontology\/.*#(.+)>(.+)\^\^<http:\/\/www.w3.org\/2001\/XMLSchema#date>\s\./){ 
		#print URLDecode($1).'	'.URLDecode($2).'	'.unescape(URLDecode($3));
		#@#print CSV '"'.URLDecode($1).'","'.URLDecode($2).'","'.URLDecode($3).'"';
		doInsert($1,$2,$3);
	}
	###################################################################################
	#
	#	"Differently" presented data
	#
	###################################################################################
	elsif($_ =~ /<http:\/\/dbpedia\.org\/resource\/(.+)>\s<http:\/\/dbpedia.org\/ontology\/.*#(.+)>(.+)\s\./){ 	
		#print URLDecode($1).'	'.URLDecode($2).'	'.unescape(URLDecode($3));
		#@#print CSV '"'.URLDecode($1).'","'.URLDecode($2).'","'.URLDecode($3).'"';
		doInsert($1,$2,$3);
	}
	else{
		$lost++;
	}
	#@#print CSV "\n"; 
}
 
close(IB);
#@#close(CSV); 
$dbh->disconnect;
 
#let the world know how many we dropped
print $lost;

Like this post? Move it on along with:

email Email | delicious delicious | digg Digg | Tweet this post Tweet | reddit Reddit | newsvine Newsvine | furl Furl | google Google | StumbleUpon Stumble | Hao Hao HaoHao


Trackback:

Comments: 0 | Comments Feed


Scroll to top

Related posts:

  1. Chinese Idiom Database Updates I hit a brick wall with my plugin development, so I spent some time updating the Chinese Idiom Database....
  2. Strip /uXXXX From String and Replace it With the Correct Unicode Character I discovered '/uXXXX' appearing where pretty unicode characters should be within my strings. The strings were to be compared to ... other strings, which would have the proper unicode characters, so I had to replace the '/uXXXX' in my strings. I couldn't find a class to do this, but found enough information to understand what needed to be done. The...
  3. Science Needs Your Brain! We need volunteers to help us evaluate our data. The evaluation will only take about 25 minutes of your time, is straightforward, and quite interesting!...
  4. JS-Kit Recent Comments PHP Script This script reads recent posts from your site's comment rss feed and prints n comments out in a nicely formatted div. Using it is as simple as changing one line, uploading the file, and pasting some code wherever you want the recent comments to be displayed....
  5. Rescuing Settings From Broken JDownloader Installation in Ubuntu Somehow I broke my JDownloader installation, by meddling with the network settings. As I had quite a lot of download packages queued, I wanted to rescue them and use them with the latest version of JDownloader. I managed to do this successfully, here's how:...

No commentsTrackback

Comments are closed.