The STRAND Bilingual Databases

The STRAND Bilingual Databases


STRAND (Structural Translation Recognition for Acquiring Natural Data) is a system for automatically acquiring pairs of documents in parallel translation on the World Wide Web. It is very accurate, with evaluation of the most recent version of the system suggesting that on average only 1 in 20 document pairs (or fewer) will be deemed to be translations when in fact they are not (Resnik and Smith, 2002).

We would very much like to simply distribute parallel corpora acquired by STRAND, but because material on the Web is subject to copyright restrictions, unfortunately this is not possible. Instead, this page provides the next best thing: databases of URL pairs acquired by STRAND, which you can download yourself for personal use.

For the older databases (prior to July 2002), it is likely that more and more of the pages will have become unavailable, or that the underlying content will have changed. Use at your own risk! Beginning with the July 2002 English-Arabic bilingual database, however, we are generally providing persistent URLs: they point to pages at the Internet Archive, which provides permanently retrievable time-stamped snapshots of Web pages.

You are welcome to use any of the STRAND Bilingual Databases for any purpose, research or commercial, so long as you visibly acknowledge its use in any product or marketing literature and cite the following in any publications:

Philip Resnik and Noah A. Smith, The Web as a parallel corpus Computational Linguistics, Volume 29 , Issue 3 (September 2003), Pages: 349 - 380.
That article gives technical details and performance assessments, and includes a discussion of related work on bitext mining by other researchers. It is a revised/extended version of For earlier papers on STRAND, see By downloading the database and/or associated scripts or programs, you agree to the following:
			      NO WARRANTY

    WE PROVIDE ABSOLUTELY NO WARRANTY OF ANY KIND EITHER EXPRESSED OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
    MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.  THE ENTIRE RISK
    AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM OR DATABASE IS WITH
    YOU.  SHOULD THIS PROGRAM OR DATABASE PROVE DEFECTIVE, YOU ASSUME THE
    COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

    IN NO EVENT SHALL THE AUTHOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING
    ANY LOST PROFITS, LOST MONIES, OR OTHER SPECIAL, INCIDENTAL OR
    CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE
    (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA OR ITS ANALYSIS
    BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY THIRD PARTIES) THE
    PROGRAM OR DATABASE.

    (Above NO WARRANTY modified from the GNU NO WARRANTY statement.)

Please report any bugs or problems to Philip Resnik, resnik@umiacs.umd.edu.


Obtaining and Using the STRAND Bilingual Databases

  1. Obtain the Wget package. Wget is a free network utility to retrieve files from the World Wide Web, which is available under the terms of the GNU public license. If you don't have it, info on obtaining it can be found at http://www.gnu.org/software/wget/wget.html. (Wget is very easy to obtain and install!)

  2. Download retrieve_database.pl, and replace the path to the Perl executable in the second line (e.g. /usr/imports/bin/perl5) with the fully qualified executable path for Perl on your system.

  3. Select one of the databases below and download.

  4. Run retrieve_database.pl database dest-dir. The first argument is the name of the database file you just downloaded, and the second is the directory where you want to put the downloaded documents. The script outputs one line per pair, indicating successful retrievals and failed retrievals (e.g. where one or both pages are no longer available).

Return to Philip Resnik's home page
Last modified July 19, 2002