wikipedia ダンプデータをデータベースにインポートする手順

node/1175

1 . データダウンロード

URLのリスト

$ lynx -listonly -dump "https://dumps.wikimedia.org/jawiki/latest/" | grep 'gz$\|z$\|bz2$' | grep -v history | cut -c 7- | tee jawiki.download.lst
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract1.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract2.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract3.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract4.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract5.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract6.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-all-titles-in-ns0.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-all-titles.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-category.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-categorylinks.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-change_tag.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-externallinks.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-geo_tags.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-image.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-imagelinks.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-iwlinks.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-langlinks.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-page.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-page_props.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-page_restrictions.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pagelinks.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles-multistream-index.txt.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles-multistream.xml.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles1.xml-p1p106175.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles2.xml-p106179p350785.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles3.xml-p350788p803163.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles4.xml-p803168p1541169.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles5.xml-p1541175p2534192.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles6.xml-p2534193p3869446.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles6.xml-p2534193p3873913.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging1.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging2.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging3.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging4.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging5.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging6.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current.xml.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current1.xml-p1p106178.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current2.xml-p106179p350785.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current3.xml-p350788p803167.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current4.xml-p803168p1541170.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current5.xml-p1541171p2534192.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current6.xml-p2534193p3869446.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current6.xml-p2534193p3873913.bz2
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-protected_titles.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-redirect.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-site_stats.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-siteinfo-namespaces.json.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-sites.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles1.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles2.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles3.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles4.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles5.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles6.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current1.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current2.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current3.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current4.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current5.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current6.xml.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-templatelinks.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-user_groups.sql.gz
https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-wbc_entity_usage.sql.gz

本文とカテゴリをダウンロード

$ wget -c https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-category.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-categorylinks.sql.gz

2 . テーブルスキーマ

tables.sql

$ wget -c https://releases.wikimedia.org/mediawiki/1.30/mediawiki-1.30.1.tar.gz
$ tar tzvf mediawiki-1.30.1.tar.gz  | grep tables.sql
-rw-rw-r-- reedy/reedy    2123 2018-06-05 20:42 mediawiki-1.30.1/maintenance/archives/patch-linktables.sql
-rw-rw-r-- reedy/reedy   51060 2018-09-20 18:41 mediawiki-1.30.1/maintenance/mssql/tables.sql
-rw-rw-r-- reedy/reedy   50391 2018-09-20 18:41 mediawiki-1.30.1/maintenance/oracle/tables.sql
-rw-rw-r-- reedy/reedy   33929 2018-09-20 18:41 mediawiki-1.30.1/maintenance/postgres/tables.sql
-rw-rw-r-- reedy/reedy   69248 2018-09-20 18:41 mediawiki-1.30.1/maintenance/tables.sql
$ gunzip -c mediawiki-1.30.1.tar.gz | tar xvf - mediawiki-1.30.1/maintenance/tables.sql
mediawiki-1.30.1/maintenance/tables.sql

3 . MWDumper

MWDumperは、早いが内部リンクテーブルを更新しない。コンテンツだけが必要ならば、これで十分。

$ git clone https://phabricator.wikimedia.org/diffusion/MWDU/mwdumper.git
$ cd mwdumper
$ sudo apt install maven
$ mvn compile
$ mvn package
$ file target/mwdumper-1.25.jar
target/mwdumper-1.25.jar: Java archive data (JAR)
$ cd ..

4 . jarファイル

$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.47.tar.gz
$ tar tvzf mysql-connector-java-5.1.47.tar.gz | grep 'bin.jar$'
-rw-r--r-- 0/0         1007505 2018-08-07 06:59 mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar
$ gunzip -c mysql-connector-java-5.1.47.tar.gz | tar xvf - mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar
mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar

5 . データベース設定確認と変更

$ sudo mysql -u root -e "show variables like 'chara%';"
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8mb4                    |
| character_set_connection | utf8mb4                    |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8mb4                    |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
$ sudo mysql -u root -e "show global variables like 'innodb_lar%';"
+---------------------+-------+
| Variable_name       | Value |
+---------------------+-------+
| innodb_large_prefix | ON    |
+---------------------+-------+
$ sudo mysql -u root -e "show global variables like 'innodb_file_format';"
+--------------------+-----------+
| Variable_name      | Value     |
+--------------------+-----------+
| innodb_file_format | Barracuda |
+--------------------+-----------+
$ sudo mysql -u root -e "show global variables like 'innodb_default_row_format';"
+---------------------------+---------+
| Variable_name             | Value   |
+---------------------------+---------+
| innodb_default_row_format | compact |
+---------------------------+---------+
$ sudo mysql -u root -e "set global innodb_default_row_format = DYNAMIC;"
$ sudo mysql -u root -e "show global variables like 'innodb_default_row_format';"
+---------------------------+---------+
| Variable_name             | Value   |
+---------------------------+---------+
| innodb_default_row_format | dynamic |
+---------------------------+---------+

6 . 準備したファイル

$ ls --size -S1 jawiki-latest-pages-articles.xml.bz2 jawiki-latest-category.sql.gz jawiki-latest-categorylinks.sql.gz mediawiki-1.30.1/maintenance/tables.sql mwdumper/target/mwdumper-1.25.jar mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar
2.6G jawiki-latest-pages-articles.xml.bz2
182M jawiki-latest-categorylinks.sql.gz
6.1M mwdumper/target/mwdumper-1.25.jar
2.9M jawiki-latest-category.sql.gz
984K mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar
 68K mediawiki-1.30.1/maintenance/tables.sql

7 . データベース作成とインポート

$ sudo mysql -u root -e "CREATE DATABASE jawiki;"

テーブル作成

$ sudo mysql -u root jawiki < mediawiki-1.30.1/maintenance/tables.sql

本文データインポート 10時間ぐらい。

$ java -jar mwdumper/target/mwdumper-1.25.jar --format=sql:1.25 ~/downloads/jawiki-latest-pages-articles.xml.bz2 | sudo mysql jawiki
...
2,324,000 pages (59.933/sec), 2,324,000 revs (59.933/sec)
2,325,000 pages (59.925/sec), 2,325,000 revs (59.925/sec)
2,326,000 pages (59.924/sec), 2,326,000 revs (59.924/sec)
2,327,000 pages (59.923/sec), 2,327,000 revs (59.923/sec)
2,327,898 pages (59.907/sec), 2,327,898 revs (59.907/sec)

カテゴリ。categorylinksは数時間。

$ gunzip -c jawiki-latest-category.sql.gz | sudo mysql -u root jawiki --
$ gunzip -c jawiki-latest-categorylinks.sql.gz | sudo mysql -u root jawiki --

8 . データ確認

MariaDB [jawiki]> select table_name, table_rows from information_schema.TABLES where table_schema = 'jawiki' order by table_rows desc limit 10;
+---------------+------------+
| table_name    | table_rows |
+---------------+------------+
| categorylinks |    7514235 |
| text          |    2686111 |
| page          |    2302155 |
| revision      |    2092671 |
| redirect      |     690266 |
| category      |     233091 |
| valid_tag     |          0 |
| log_search    |          0 |
| ip_changes    |          0 |
| uploadstash   |          0 |
+---------------+------------+
10 rows in set (0.00 sec)
MariaDB [jawiki]> 
SELECT
    p.page_id AS "page_id",
    CAST(p.page_title AS CHAR(10000) CHARACTER SET utf8) AS "page_title",
    r.rev_text_id AS "revision_id",
    t.old_id AS "text_id",
    left(t.old_text,50)
FROM
    page p
        INNER JOIN revision r
            ON p.page_latest = r.rev_id
        INNER JOIN text t
            ON r.rev_text_id = t.old_id
limit 5 offset 2000;
+---------+-----------------------------+-------------+----------+----------------------------------------------------+
| page_id | page_title                  | revision_id | text_id  | left(t.old_text,50)                                |
+---------+-----------------------------+-------------+----------+----------------------------------------------------+
|    2610 | 2010年代                    |    70706915 | 70706915 | {{Otheruses||日本ローカルの事柄|2010年▒             |
|    2612 | 定型詩                      |    59312594 | 59312594 | '''定型詩'''(ていけいし、{{lang-en|ryth           |
|    2614 | 解決済みのバグ報告          |    18446452 | 18446452 | {{Archives|Wikipedia:バグの報告}}
ここに▒           |
|    2616 | 田沢湖                      |    70736019 | 70736019 | {{Infobox 湖
|名称 = 田沢湖
|画像 =[[File:         |
|    2617 | 活火山                      |    70040564 | 70040564 | '''活火山'''(かつかざん、かっかざ▒                 |
+---------+-----------------------------+-------------+----------+----------------------------------------------------+
5 rows in set (0.00 sec)

カテゴリのリストアップ、記事タイトルとその属するカテゴリ。

9 . DBサイズ

DB最適化後

$ sudo ncdu /var/lib/mysql
   16.6 GiB [##########] /jawiki

DB最適化前

   19.8 GiB [##########] /jawiki

10 . バックアップとリストア

$ sudo mysqldump jawiki > dump_jawiki.sql
$ sudo mysql jawiki < dump_jawiki.sql