wikipedia ダンプデータをデータベースにインポートする手順
Nid: 1175
1 . データダウンロード
URLのリスト
$ lynx -listonly -dump "https://dumps.wikimedia.org/jawiki/latest/" | grep 'gz$\|z$\|bz2$' | grep -v history | cut -c 7- | tee jawiki.download.lst https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract1.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract2.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract3.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract4.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract5.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-abstract6.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-all-titles-in-ns0.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-all-titles.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-category.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-categorylinks.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-change_tag.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-externallinks.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-geo_tags.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-image.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-imagelinks.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-iwlinks.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-langlinks.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-page.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-page_props.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-page_restrictions.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pagelinks.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles-multistream-index.txt.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles-multistream.xml.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles1.xml-p1p106175.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles2.xml-p106179p350785.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles3.xml-p350788p803163.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles4.xml-p803168p1541169.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles5.xml-p1541175p2534192.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles6.xml-p2534193p3869446.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles6.xml-p2534193p3873913.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging1.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging2.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging3.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging4.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging5.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-logging6.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current.xml.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current1.xml-p1p106178.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current2.xml-p106179p350785.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current3.xml-p350788p803167.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current4.xml-p803168p1541170.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current5.xml-p1541171p2534192.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current6.xml-p2534193p3869446.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-meta-current6.xml-p2534193p3873913.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-protected_titles.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-redirect.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-site_stats.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-siteinfo-namespaces.json.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-sites.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles1.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles2.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles3.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles4.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles5.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-articles6.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current1.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current2.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current3.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current4.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current5.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-stub-meta-current6.xml.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-templatelinks.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-user_groups.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-wbc_entity_usage.sql.gz
本文とカテゴリをダウンロード
$ wget -c https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-category.sql.gz https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-categorylinks.sql.gz
2 . テーブルスキーマ
tables.sql
$ wget -c https://releases.wikimedia.org/mediawiki/1.30/mediawiki-1.30.1.tar.gz $ tar tzvf mediawiki-1.30.1.tar.gz | grep tables.sql -rw-rw-r-- reedy/reedy 2123 2018-06-05 20:42 mediawiki-1.30.1/maintenance/archives/patch-linktables.sql -rw-rw-r-- reedy/reedy 51060 2018-09-20 18:41 mediawiki-1.30.1/maintenance/mssql/tables.sql -rw-rw-r-- reedy/reedy 50391 2018-09-20 18:41 mediawiki-1.30.1/maintenance/oracle/tables.sql -rw-rw-r-- reedy/reedy 33929 2018-09-20 18:41 mediawiki-1.30.1/maintenance/postgres/tables.sql -rw-rw-r-- reedy/reedy 69248 2018-09-20 18:41 mediawiki-1.30.1/maintenance/tables.sql $ gunzip -c mediawiki-1.30.1.tar.gz | tar xvf - mediawiki-1.30.1/maintenance/tables.sql mediawiki-1.30.1/maintenance/tables.sql
3 . MWDumper
MWDumperは、早いが内部リンクテーブルを更新しない。コンテンツだけが必要ならば、これで十分。
$ git clone https://phabricator.wikimedia.org/diffusion/MWDU/mwdumper.git $ cd mwdumper $ sudo apt install maven $ mvn compile $ mvn package $ file target/mwdumper-1.25.jar target/mwdumper-1.25.jar: Java archive data (JAR) $ cd ..
4 . jarファイル
$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.47.tar.gz $ tar tvzf mysql-connector-java-5.1.47.tar.gz | grep 'bin.jar$' -rw-r--r-- 0/0 1007505 2018-08-07 06:59 mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar $ gunzip -c mysql-connector-java-5.1.47.tar.gz | tar xvf - mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar
5 . データベース設定確認と変更
$ sudo mysql -u root -e "show variables like 'chara%';" +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8mb4 | | character_set_connection | utf8mb4 | | character_set_database | utf8mb4 | | character_set_filesystem | binary | | character_set_results | utf8mb4 | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+
$ sudo mysql -u root -e "show global variables like 'innodb_lar%';" +---------------------+-------+ | Variable_name | Value | +---------------------+-------+ | innodb_large_prefix | ON | +---------------------+-------+
$ sudo mysql -u root -e "show global variables like 'innodb_file_format';" +--------------------+-----------+ | Variable_name | Value | +--------------------+-----------+ | innodb_file_format | Barracuda | +--------------------+-----------+
$ sudo mysql -u root -e "show global variables like 'innodb_default_row_format';" +---------------------------+---------+ | Variable_name | Value | +---------------------------+---------+ | innodb_default_row_format | compact | +---------------------------+---------+ $ sudo mysql -u root -e "set global innodb_default_row_format = DYNAMIC;" $ sudo mysql -u root -e "show global variables like 'innodb_default_row_format';" +---------------------------+---------+ | Variable_name | Value | +---------------------------+---------+ | innodb_default_row_format | dynamic | +---------------------------+---------+
6 . 準備したファイル
$ ls --size -S1 jawiki-latest-pages-articles.xml.bz2 jawiki-latest-category.sql.gz jawiki-latest-categorylinks.sql.gz mediawiki-1.30.1/maintenance/tables.sql mwdumper/target/mwdumper-1.25.jar mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar 2.6G jawiki-latest-pages-articles.xml.bz2 182M jawiki-latest-categorylinks.sql.gz 6.1M mwdumper/target/mwdumper-1.25.jar 2.9M jawiki-latest-category.sql.gz 984K mysql-connector-java-5.1.47/mysql-connector-java-5.1.47-bin.jar 68K mediawiki-1.30.1/maintenance/tables.sql
7 . データベース作成とインポート
$ sudo mysql -u root -e "CREATE DATABASE jawiki;"
テーブル作成
$ sudo mysql -u root jawiki < mediawiki-1.30.1/maintenance/tables.sql
本文データインポート 10時間ぐらい。
$ java -jar mwdumper/target/mwdumper-1.25.jar --format=sql:1.25 ~/downloads/jawiki-latest-pages-articles.xml.bz2 | sudo mysql jawiki ... 2,324,000 pages (59.933/sec), 2,324,000 revs (59.933/sec) 2,325,000 pages (59.925/sec), 2,325,000 revs (59.925/sec) 2,326,000 pages (59.924/sec), 2,326,000 revs (59.924/sec) 2,327,000 pages (59.923/sec), 2,327,000 revs (59.923/sec) 2,327,898 pages (59.907/sec), 2,327,898 revs (59.907/sec)
カテゴリ。categorylinksは数時間。
$ gunzip -c jawiki-latest-category.sql.gz | sudo mysql -u root jawiki -- $ gunzip -c jawiki-latest-categorylinks.sql.gz | sudo mysql -u root jawiki --
8 . データ確認
MariaDB [jawiki]> select table_name, table_rows from information_schema.TABLES where table_schema = 'jawiki' order by table_rows desc limit 10; +---------------+------------+ | table_name | table_rows | +---------------+------------+ | categorylinks | 7514235 | | text | 2686111 | | page | 2302155 | | revision | 2092671 | | redirect | 690266 | | category | 233091 | | valid_tag | 0 | | log_search | 0 | | ip_changes | 0 | | uploadstash | 0 | +---------------+------------+ 10 rows in set (0.00 sec)
MariaDB [jawiki]> SELECT p.page_id AS "page_id", CAST(p.page_title AS CHAR(10000) CHARACTER SET utf8) AS "page_title", r.rev_text_id AS "revision_id", t.old_id AS "text_id", left(t.old_text,50) FROM page p INNER JOIN revision r ON p.page_latest = r.rev_id INNER JOIN text t ON r.rev_text_id = t.old_id limit 5 offset 2000; +---------+-----------------------------+-------------+----------+----------------------------------------------------+ | page_id | page_title | revision_id | text_id | left(t.old_text,50) | +---------+-----------------------------+-------------+----------+----------------------------------------------------+ | 2610 | 2010年代 | 70706915 | 70706915 | {{Otheruses||日本ローカルの事柄|2010年▒ | | 2612 | 定型詩 | 59312594 | 59312594 | '''定型詩'''(ていけいし、{{lang-en|ryth | | 2614 | 解決済みのバグ報告 | 18446452 | 18446452 | {{Archives|Wikipedia:バグの報告}} ここに▒ | | 2616 | 田沢湖 | 70736019 | 70736019 | {{Infobox 湖 |名称 = 田沢湖 |画像 =[[File: | | 2617 | 活火山 | 70040564 | 70040564 | '''活火山'''(かつかざん、かっかざ▒ | +---------+-----------------------------+-------------+----------+----------------------------------------------------+ 5 rows in set (0.00 sec)
カテゴリのリストアップ、記事タイトルとその属するカテゴリ。
9 . DBサイズ
DB最適化後
$ sudo ncdu /var/lib/mysql 16.6 GiB [##########] /jawiki
DB最適化前
19.8 GiB [##########] /jawiki
10 . バックアップとリストア
$ sudo mysqldump jawiki > dump_jawiki.sql $ sudo mysql jawiki < dump_jawiki.sql
Link: