Perl Script to Insert DBpedia Infobox Data into a MySQL Database
I’m in the middle of a project involving Wikipedia and ResearchCyc, and I needed to get the data contained in DBpedia’s Infobox RDF Triple file into a MySQL database so I could use it in conjunction with the database created by the excellent Wikipedia Miner.
This script parses out the Wikipedia page, DBpedia Infobox Predicate and Infobox subject, and inserts them into a MySQL table. I thought I’d share it with The Internet in case someone else wanted to work with DBpedia infobox data in the same way.
One ends up with a table filled with entries like:
+----------+-----------------+---------------+------------------------------------------+ | id | predicate | subject | page | +----------+-----------------+---------------+------------------------------------------+ | 22378459 | foundationDate | 1918-10-18 | Fortaleza_Esporte_Clube | | 21460098 | areaCode | 07404 | Bösingen,_Baden-Württemberg | | 21536062 | demonym | Caronelli | Carona_(BG) | | 23913919 | draftyear | 1991 | Mike_Pritchard | | 22278400 | producer | Bow_Wow | Face_Off_(Bow_Wow_&_Omarion_album) | | 22320735 | bodyStyle | Coup | Fiat_1200 | | 22245790 | classis | Magnoliopsida | Euphorbia_atropurpurea | | 23710420 | order | Sapindales | Malleastrum_leroyi | | 25463380 | starring | Lena_Headey | The_Contractor | | 26021756 | latitudeMinutes | 27 | Walnut_Township,_Atchison_County,_Kansas | +----------+-----------------+---------------+------------------------------------------+
And a table containing the frequencies of each predicate:
+-------------------------+-----------+ | predicate | frequency | +-------------------------+-----------+ | background | 30863 | | homeTown | 49083 | | genre | 245669 | | label | 132028 | | associatedBand | 35371 | | associatedMusicalArtist | 35371 | | artist | 98762 | | producer | 96997 | | reviews | 35207 | | recordDate | 24784 | +-------------------------+-----------+
I used the DBpedia infobox data from this post: “DBpedia – Rethinking Wikipedia infobox extraction“, kindly provided by Georgi Kobilarov. The data linked to in that post is much more suitable for my requirements than previously available DBpedia infobox data – instead of multiple predicates for birth date – birthDate, dateBirth, dateOfBirth, birth…, they’ve been mapped to dbpedia:Person#birthdate. Wonderful!
To handle the occasional /uXXXX, I used Unicode::String and Unicode::Escape, by Gisle Aas and Hitoshi Amano respectively.
The script could be improved: though the Wikipedia Pages and DBpedia Infobox Predicates are fine, some of the subjects are rather … interesting. As I’m currently only interested in the 200 most frequently occuring predicates, I haven’t put more time into smoothing out some of the more interesting subject data. If anyone makes changes to the script, please let me know and I’ll update this post.
Here it is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | #!/usr/bin/perl ################################################################################### # # Parse out the Wikipedia page title, the DBpedia Infobox Relation and the subject from the infobox.nt file. # # Inserts data into a MySQL database tables with the structures: # # infoboxes # # +-----------+--------------+------+-----+---------+----------------+ # | Field | Type | Null | Key | Default | Extra | # +-----------+--------------+------+-----+---------+----------------+ # | id | int(11) | NO | PRI | NULL | auto_increment | # | predicate | varchar(255) | YES | | NULL | | # | subject | varchar(255) | YES | | NULL | | # | page | varchar(255) | YES | | NULL | | # | type | int(1) | YES | | NULL | | # +-----------+--------------+------+-----+---------+----------------+ # # predicateFrequency # # +-----------+--------------+------+-----+---------+-------+ # | Field | Type | Null | Key | Default | Extra | # +-----------+--------------+------+-----+---------+-------+ # | predicate | varchar(255) | NO | PRI | | | # | frequency | int(6) | YES | | NULL | | # +-----------+--------------+------+-----+---------+-------+ # # # To output to a CSV file, uncomment lines beginning with "#@#" # # DBpedia infobox.nt downloaded from: http://blog.georgikobilarov.com/2008/10/dbpedia-rethinking-wikipedia-infobox-extraction/ # # Script written by Michael Robinson - faceleg - http://pagesofinterest.net/ # # Contact: mike@pagesofinterest.net # #################################################################################### use strict; use warnings; use DBI; use DBD::mysql; #################################################################################### # # Unicode: http://search.cpan.org/~gaas/Unicode-String-2.09/String.pm # Unicode::Escape: http://search.cpan.org/~itwarrior/Unicode-Escape-0.0.2/lib/Unicode/Escape.pm # #################################################################################### use Unicode::Escape qw(escape unescape); #Database details my $username = ''; my $pass = ''; my $db = 'wikipedia'; my $querystring; #connect to database my $dbh = DBI->connect( "dbi:mysql:$db", $username, $pass, { 'PrintError' => 1, 'RaiseError' => 1 } ); my $lost = 0; # amount of entries that are "lost" due to crazy formatting. #File & directory declarations my $dir = "/path/to/": my $DBpedia = "infobox.nt"; #@#my $CSV = "cleanedRDF.csv"; ################################################################################### # # Decode URL encodings # # thanks to: http://glennf.com/writing/hexadecimal.url.encoding.html # ################################################################################### sub URLDecode { my $theURL = $_[0]; $theURL =~ tr/+/ /; $theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg; $theURL =~ s/<!–(.|\n)*–>//g; return $theURL; } ################################################################################### # # If a string begins with 0 or more white spaces and ", and ends with " and 0 or more white spaces, remove the white spaces and ". # ################################################################################### sub deQuote { my $deQuoted = $_[0]; if($deQuoted =~ /^\s*".*"\s*$/){ $deQuoted =~ /\s*"(.*)"\s*$/; return $1; } else{ return $deQuoted; } } ################################################################################### # # Inserts data into MySQL table # ################################################################################### sub doInsert { if($3 && $3 !~ /^\s*$/ && $3 ne ""){ #make sure the subject isn't empty... in theory - doesn't seem to work for radio station: frequency...?! #insert triple into infoboxes table $querystring = "insert into infoboxes (page,predicate,subject) values("; $querystring.= $dbh->quote(URLDecode($1)).",".$dbh->quote(deQuote(URLDecode($2))).",".$dbh->quote(deQuote(unescape(URLDecode($3)))).")"; $dbh->do($querystring); #insert or update data in predicateFrequency table my $predicate = $dbh->quote(deQuote(URLDecode($2))); $querystring = "INSERT INTO predicateFrequency(predicate,frequency) VALUES($predicate,0) ON duplicate KEY UPDATE frequency=frequency+1"; $dbh->do($querystring); } } # go through dbpedia infobox relations open (IB, $dir.$DBpedia); #@#open (CSV,'<'.$dir.$DBpedia); while(<IB>) { chomp; #offer a byte sized sacrifice to the great compiler ################################################################################### # # Wikipedia page title should not be dequoted! # # Split into elements # $1 = Wikipedia Page # $2 = Predicate # $3 = Object # ################################################################################### # # # standard triple, no extra screwing about required. # ################################################################################### if($_ =~ /<http:\/\/dbpedia\.org\/resource\/(.+)>\s<http:\/\/dbpedia.org\/ontology\/.*#(.+)>\s<http:\/\/dbpedia.org\/resource\/(.*)>.*/){ #print URLDecode($1).' '.URLDecode($2).' '.unescape(URLDecode($3)); #@#print CSV '"'.URLDecode($1).'","'.URLDecode($2).'","'.URLDecode($3).'"'; doInsert($1,$2,$3); } ################################################################################### # # standard triple with date, remove trailing declaration. # ################################################################################### elsif($_ =~ /<http:\/\/dbpedia\.org\/resource\/(.+)>\s<http:\/\/dbpedia.org\/ontology\/.*#(.+)>(.+)\^\^<http:\/\/www.w3.org\/2001\/XMLSchema#date>\s\./){ #print URLDecode($1).' '.URLDecode($2).' '.unescape(URLDecode($3)); #@#print CSV '"'.URLDecode($1).'","'.URLDecode($2).'","'.URLDecode($3).'"'; doInsert($1,$2,$3); } ################################################################################### # # "Differently" presented data # ################################################################################### elsif($_ =~ /<http:\/\/dbpedia\.org\/resource\/(.+)>\s<http:\/\/dbpedia.org\/ontology\/.*#(.+)>(.+)\s\./){ #print URLDecode($1).' '.URLDecode($2).' '.unescape(URLDecode($3)); #@#print CSV '"'.URLDecode($1).'","'.URLDecode($2).'","'.URLDecode($3).'"'; doInsert($1,$2,$3); } else{ $lost++; } #@#print CSV "\n"; } close(IB); #@#close(CSV); $dbh->disconnect; #let the world know how many we dropped print $lost; |
Like this post? Move it on along with:
Email |
delicious |
Digg |
Tweet |
Reddit |
Newsvine |
Furl |
Google |
Stumble |
HaoHao
| Trackback: |
Scroll to post title
Chinese Idiom Database Updates
I hit a brick wall with my plugin development, so I spent some time updating the Chinese Idiom Database.
A long time ago a kind person emailed me with some suggestions for the database. I thought all of his ideas were wonderful, and have just finished implementing them all, plus some minor improvements to the code, which will make further updates less painful. As this database was my first attempt at PHP programming, the code was … messy.
I also implemented a few things I wanted to do a long time ago, but couldn’t because of the state the code was in.
These are the changes:
There is now an “All Idioms” page that displays all the idioms’ English meaning, which will make it possible for people just starting to learn Chinese to browse idioms.
Both the English and the Chinese “All Idioms” pages’ links open in a Greybox above the page, which avoids a full page load.
One may move forward or backwards through the idioms within the Greybox, using the “Previous, Next” links.
Idioms may be given thumbs-up/down ratings.
Each time an idiom is returned as a search result or viewed by following a link from one of the “All Idioms” pages, a count is incremented. This will allow us to see which idioms have been viewed the most, which will give me some idea of how many people actually use the database (if any at all).
The hardest part was making the Greybox pages show the “Next/Previous” links, but that was only hard because I am a still pretty new to MySQL. During the page generation, I needed to be able to pull the idiom/English translation from the next and previous entries in the list of all idioms. After much irritation, I managed to do it like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | $result = mysql_query("SELECT * FROM idioms WHERE zh_idiom!=''") or die(mysql_error());
$result_copy = mysql_query("SELECT * FROM idioms WHERE zh_idiom!=''") or die(mysql_error());
$prev = mysql_result($result_copy, mysql_num_rows($result_copy)-1 ,'zh_idiom' );
$next = mysql_result($result_copy, 1 ,'zh_idiom');
$count = 0;
while($row = mysql_fetch_array($result)) { //make one page for each idiom
$file_name = $row['zh_idiom'] . ".php";
$file = fopen($file_name,"w");
$update_count = "<?php ".'$'."dbhost = 'localhost';
".'$'."dbuser = 'pagesofi_student';
".'$'."dbpass = '".'$'."a|Qt;jXg+Nf';
".'$'."dbname = 'pagesofi_idioms';
".'$'."conn = mysql_connect(".'$'."dbhost, ".'$'."dbuser, ".'$'."dbpass) or die('Error connecting to mysql');
mysql_select_db('pagesofi_idioms');
mysql_query(\"UPDATE idioms SET times_accessed=times_accessed+1 WHERE zh_idiom='".$row['zh_idiom']."'\") or die(mysql_error());
".'$'."result = mysql_query(\"SELECT * FROM idioms WHERE zh_idiom='".$row['zh_idiom']."'\");
".'$'."count = mysql_fetch_array(".'$'."result);
?>";
$top = "<META http-equiv=Content-Type content='text/html; charset=UTF-8'><head><link rel='stylesheet' type='text/css' media='screen' href='idioms_individual.css' /></head>
".$update_count ."
<title>".$row['zh_idiom']."</title>
<html><body><p></p>
<p>".$row['zh_idiom']."</p><br>".$row['pinyin']."<br><br>". "<p1>".stripslashes($row['en_translation'])."</p1><br><br><p5>" . $row['example'] . "</p5><br/><br/><p2>Keywords: " .$row['key_word']."<br/>Viewed <?php echo " .'$'. "count['times_accessed'] ?> times. <br/>Entered by ".$row['creator']." on ".$row['date_created']."</p2><br/><div class='js-kit-rating' view='score' title='".$row['zh_idiom']."' path='".$row['zh_idiom']."' permalink='http://pagesofinterest.net/idiom/search_results.php?zhongwen=".$row['zh_idiom']."'></div>
<div id='nextPrev'><a id='prev' href='".$prev.".php'>Previous</a><a id='next' href='".$next.".php'>Next</a></div>
<script src='http://js-kit.com/ratings.js'></script>
</body>
</html>";
fwrite($file, $top); |
I’m sure that my way is not the best way, but it works. If the database ever becomes massive, or I can’t find anything better to do, I’ll look at changing it.
As this script is only run when an idiom is added, so a little inefficiency is OK, right?
Shhh… I’ll make it better later, promise!
Like this post? Move it on along with:
Email |
delicious |
Digg |
Tweet |
Reddit |
Newsvine |
Furl |
Google |
Stumble |
HaoHao
| Trackback: |
Scroll to post title
























Recent Comments
Js Kit Comments Correct Usage Of The Permalink And Path Attributes
http://www.tanmifen.com
Tue, 26 Jan 2010 08:07:30 +0000
Great Wall
LinkedIn
Thu, 14 Jan 2010 12:54:50 +0000
Tag Cloud
LinkedIn
Thu, 14 Jan 2010 12:54:50 +0000
The Best Photographer In Shanghai
When we pick them up :( Don't know when that will be yet!
Chose another 49 today
Wed, 13 Jan 2010 14:13:21 +0000
The Best Photographer In Shanghai
When do we get to see them!!!!!
Tue, 12 Jan 2010 19:08:51 +0000
Installing Jdownloader In Ubuntu
Thank You!!!!
It works like a charm!!!!!!!!!!!!
Tue, 05 Jan 2010 13:02:35 +0000