Joshua
open source statistical hierarchical phrase-based machine translation system
 All Classes Namespaces Functions Variables Typedefs Enumerations Enumerator Friends
joshua.subsample.Subsampler Class Reference
Inheritance diagram for joshua.subsample.Subsampler:
[legend]
Collaboration diagram for joshua.subsample.Subsampler:
[legend]

List of all members.

Public Member Functions

 Subsampler (String[] testFiles, int maxN, int targetCount) throws IOException
void subsample (String filelist, float targetFtoERatio, String extf, String exte, String fpath, String epath, String output) throws IOException

Static Public Member Functions

static void main (String[] args)

Protected Member Functions

void subsample (String filelist, float targetFtoERatio, PhraseWriter out, BiCorpusFactory bcFactory) throws IOException

Protected Attributes

Map< Phrase, Integer > ngramCounts
int maxN
int targetCount
int maxSubsample = 1500000

Static Protected Attributes

static final int MAX_SENTENCE_LENGTH = 100
static final int MIN_RATIO_LENGTH = 10

Private Member Functions

HashMap< Phrase, Integer > loadNgrams (String[] files) throws IOException
void subsample (HashMap< PhrasePair, PhrasePair > set, BiCorpus bc, int minLength, int maxLength, float targetFtoERatio)

Detailed Description

A class for subsampling a large (F,E)-parallel sentence-aligned corpus to generate a smaller corpus whose N-grams are relevant to some seed corpus. The idea of subsampling owes to Kishore Papineni.

Author:
UMD (Jimmy Lin, Chris Dyer, et al.)
wren ng thornton wren@.nosp@m.user.nosp@m.s.sou.nosp@m.rcef.nosp@m.orge..nosp@m.net
Version:
$LastChangedDate$

Constructor & Destructor Documentation

joshua.subsample.Subsampler.Subsampler ( String[]  testFiles,
int  maxN,
int  targetCount 
) throws IOException

Here is the call graph for this function:


Member Function Documentation

HashMap<Phrase, Integer> joshua.subsample.Subsampler.loadNgrams ( String[]  files) throws IOException [private]

Here is the call graph for this function:

Here is the caller graph for this function:

static void joshua.subsample.Subsampler.main ( String[]  args) [static]

Reimplemented in joshua.subsample.AlignedSubsampler.

Here is the call graph for this function:

void joshua.subsample.Subsampler.subsample ( String  filelist,
float  targetFtoERatio,
String  extf,
String  exte,
String  fpath,
String  epath,
String  output 
) throws IOException

The general subsampler function for external use.

Parameters:
filelistlist of source files to subsample from
targetFtoERatiogoal for ratio of output F length to output E length
extfextension of F files
exteextension of E files
fpathpath to source F files
epathpath to source E files
outputbasename for output files (will append extensions)

Here is the caller graph for this function:

void joshua.subsample.Subsampler.subsample ( String  filelist,
float  targetFtoERatio,
PhraseWriter  out,
BiCorpusFactory  bcFactory 
) throws IOException [protected]

The main wrapper for the subsample worker. Closes the PhraseWriter before exiting.

Here is the call graph for this function:

void joshua.subsample.Subsampler.subsample ( HashMap< PhrasePair, PhrasePair set,
BiCorpus  bc,
int  minLength,
int  maxLength,
float  targetFtoERatio 
) [private]

The worker function for subsampling.

Parameters:
setThe set to put selected sentences into
bcThe sentence-aligned corpus to read from
minLengthThe minimum F sentence length
maxLengthThe maximum F sentence length
targetFtoERatioThe desired ratio of F length to E length

Here is the call graph for this function:


Member Data Documentation

final int joshua.subsample.Subsampler.MAX_SENTENCE_LENGTH = 100 [static, protected]
int joshua.subsample.Subsampler.maxSubsample = 1500000 [protected]
final int joshua.subsample.Subsampler.MIN_RATIO_LENGTH = 10 [static, protected]