|
Joshua
open source statistical hierarchical phrase-based machine translation system
|
Public Member Functions | |
| Subsampler (String[] testFiles, int maxN, int targetCount) throws IOException | |
| void | subsample (String filelist, float targetFtoERatio, String extf, String exte, String fpath, String epath, String output) throws IOException |
Static Public Member Functions | |
| static void | main (String[] args) |
Protected Member Functions | |
| void | subsample (String filelist, float targetFtoERatio, PhraseWriter out, BiCorpusFactory bcFactory) throws IOException |
Protected Attributes | |
| Map< Phrase, Integer > | ngramCounts |
| int | maxN |
| int | targetCount |
| int | maxSubsample = 1500000 |
Static Protected Attributes | |
| static final int | MAX_SENTENCE_LENGTH = 100 |
| static final int | MIN_RATIO_LENGTH = 10 |
Private Member Functions | |
| HashMap< Phrase, Integer > | loadNgrams (String[] files) throws IOException |
| void | subsample (HashMap< PhrasePair, PhrasePair > set, BiCorpus bc, int minLength, int maxLength, float targetFtoERatio) |
A class for subsampling a large (F,E)-parallel sentence-aligned corpus to generate a smaller corpus whose N-grams are relevant to some seed corpus. The idea of subsampling owes to Kishore Papineni.
| joshua.subsample.Subsampler.Subsampler | ( | String[] | testFiles, |
| int | maxN, | ||
| int | targetCount | ||
| ) | throws IOException |
| HashMap<Phrase, Integer> joshua.subsample.Subsampler.loadNgrams | ( | String[] | files | ) | throws IOException [private] |
| static void joshua.subsample.Subsampler.main | ( | String[] | args | ) | [static] |
| void joshua.subsample.Subsampler.subsample | ( | String | filelist, |
| float | targetFtoERatio, | ||
| String | extf, | ||
| String | exte, | ||
| String | fpath, | ||
| String | epath, | ||
| String | output | ||
| ) | throws IOException |
The general subsampler function for external use.
| filelist | list of source files to subsample from |
| targetFtoERatio | goal for ratio of output F length to output E length |
| extf | extension of F files |
| exte | extension of E files |
| fpath | path to source F files |
| epath | path to source E files |
| output | basename for output files (will append extensions) |
| void joshua.subsample.Subsampler.subsample | ( | String | filelist, |
| float | targetFtoERatio, | ||
| PhraseWriter | out, | ||
| BiCorpusFactory | bcFactory | ||
| ) | throws IOException [protected] |
The main wrapper for the subsample worker. Closes the PhraseWriter before exiting.
| void joshua.subsample.Subsampler.subsample | ( | HashMap< PhrasePair, PhrasePair > | set, |
| BiCorpus | bc, | ||
| int | minLength, | ||
| int | maxLength, | ||
| float | targetFtoERatio | ||
| ) | [private] |
The worker function for subsampling.
| set | The set to put selected sentences into |
| bc | The sentence-aligned corpus to read from |
| minLength | The minimum F sentence length |
| maxLength | The maximum F sentence length |
| targetFtoERatio | The desired ratio of F length to E length |
final int joshua.subsample.Subsampler.MAX_SENTENCE_LENGTH = 100 [static, protected] |
int joshua.subsample.Subsampler.maxN [protected] |
int joshua.subsample.Subsampler.maxSubsample = 1500000 [protected] |
final int joshua.subsample.Subsampler.MIN_RATIO_LENGTH = 10 [static, protected] |
Map<Phrase, Integer> joshua.subsample.Subsampler.ngramCounts [protected] |
int joshua.subsample.Subsampler.targetCount [protected] |