Inheritance diagram for joshua.subsample.Subsampler:

Collaboration diagram for joshua.subsample.Subsampler:

Public Member Functions
	Subsampler (String[] testFiles, int maxN, int targetCount) throws IOException
void	subsample (String filelist, float targetFtoERatio, String extf, String exte, String fpath, String epath, String output) throws IOException
Static Public Member Functions
static void	main (String[] args)
Protected Member Functions
void	subsample (String filelist, float targetFtoERatio, PhraseWriter out, BiCorpusFactory bcFactory) throws IOException
Protected Attributes
Map< Phrase, Integer >	ngramCounts
int	maxN
int	targetCount
int	maxSubsample = 1500000
Static Protected Attributes
static final int	MAX_SENTENCE_LENGTH = 100
static final int	MIN_RATIO_LENGTH = 10
Private Member Functions
HashMap< Phrase, Integer >	loadNgrams (String[] files) throws IOException
void	subsample (HashMap< PhrasePair, PhrasePair > set, BiCorpus bc, int minLength, int maxLength, float targetFtoERatio)

Detailed Description

A class for subsampling a large (F,E)-parallel sentence-aligned corpus to generate a smaller corpus whose N-grams are relevant to some seed corpus. The idea of subsampling owes to Kishore Papineni.

Author:: UMD (Jimmy Lin, Chris Dyer, et al.); wren ng thornton wren@.nosp@m.user.nosp@m.s.sou.nosp@m.rcef.nosp@m.orge..nosp@m.net

Version:: $LastChangedDate$

Constructor & Destructor Documentation

joshua.subsample.Subsampler.Subsampler	(	String[]	testFiles,
		int	maxN,
		int	targetCount
	)		throws IOException

Here is the call graph for this function:

Member Function Documentation

HashMap<Phrase, Integer> joshua.subsample.Subsampler.loadNgrams ( String[] files ) throws IOException [private]

Here is the call graph for this function:

Here is the caller graph for this function:

static void joshua.subsample.Subsampler.main ( String[] args ) [static]

Reimplemented in joshua.subsample.AlignedSubsampler.

Here is the call graph for this function:

void joshua.subsample.Subsampler.subsample	(	String	filelist,
		float	targetFtoERatio,
		String	extf,
		String	exte,
		String	fpath,
		String	epath,
		String	output
	)		throws IOException

The general subsampler function for external use.

Parameters:

filelist	list of source files to subsample from
targetFtoERatio	goal for ratio of output F length to output E length
extf	extension of F files
exte	extension of E files
fpath	path to source F files
epath	path to source E files
output	basename for output files (will append extensions)

Here is the caller graph for this function:

void joshua.subsample.Subsampler.subsample	(	String	filelist,
		float	targetFtoERatio,
		PhraseWriter	out,
		BiCorpusFactory	bcFactory
	)		throws IOException `[protected]`

The main wrapper for the subsample worker. Closes the PhraseWriter before exiting.

Here is the call graph for this function:

void joshua.subsample.Subsampler.subsample	(	HashMap< PhrasePair, PhrasePair >	set,
		BiCorpus	bc,
		int	minLength,
		int	maxLength,
		float	targetFtoERatio
	)		`[private]`

The worker function for subsampling.

Parameters:

set	The set to put selected sentences into
bc	The sentence-aligned corpus to read from
minLength	The minimum F sentence length
maxLength	The maximum F sentence length
targetFtoERatio	The desired ratio of F length to E length

Here is the call graph for this function:

Member Data Documentation

final int joshua.subsample.Subsampler.MAX_SENTENCE_LENGTH = 100 [static, protected]

int joshua.subsample.Subsampler.maxN [protected]

int joshua.subsample.Subsampler.maxSubsample = 1500000 [protected]

final int joshua.subsample.Subsampler.MIN_RATIO_LENGTH = 10 [static, protected]

Map<Phrase, Integer> joshua.subsample.Subsampler.ngramCounts [protected]

int joshua.subsample.Subsampler.targetCount [protected]

Public Member Functions

Static Public Member Functions

Protected Member Functions

Protected Attributes

Static Protected Attributes

Private Member Functions

Detailed Description

Constructor & Destructor Documentation

Member Function Documentation

Member Data Documentation