Learning Internal Representations (COLT 1995)

Baxter, Jonathan

doi:10.1145/225298.225336

Full-text links:

Download:

Current browse context:

cs.LG

< prev | next >

new | recent | 1911

Computer Science > Machine Learning

Title: Learning Internal Representations (COLT 1995)

Authors: Jonathan Baxter

(Submitted on 13 Nov 2019 (v1), last revised 19 Dec 2019 (this version, v3))

Abstract: Probably the most important problem in machine learning is the preliminary biasing of a learner's hypothesis space so that it is small enough to ensure good generalisation from reasonable training sets, yet large enough that it contains a good solution to the problem being learnt. In this paper a mechanism for {\em automatically} learning or biasing the learner's hypothesis space is introduced. It works by first learning an appropriate {\em internal representation} for a learning environment and then using that representation to bias the learner's hypothesis space for the learning of future tasks drawn from the same environment.
An internal representation must be learnt by sampling from {\em many similar tasks}, not just a single task as occurs in ordinary machine learning. It is proved that the number of examples $m$ {\em per task} required to ensure good generalisation from a representation learner obeys $m = O(a+b/n)$ where $n$ is the number of tasks being learnt and $a$ and $b$ are constants. If the tasks are learnt independently ({\em i.e.} without a common representation) then $m=O(a+b)$. It is argued that for learning environments such as speech and character recognition $b\gg a$ and hence representation learning in these environments can potentially yield a drastic reduction in the number of examples required per task. It is also proved that if $n = O(b)$ (with $m=O(a+b/n)$) then the representation learnt will be good for learning novel tasks from the same environment, and that the number of examples required to generalise well on a novel task will be reduced to $O(a)$ (as opposed to $O(a+b)$ if no representation is used).
It is shown that gradient descent can be used to train neural network representations and experiment results are reported providing strong qualitative support for the theoretical results.

Subjects:	Machine Learning (cs.LG); Machine Learning (stat.ML)
Journal reference:	COLT '95 Proceedings of the eighth annual conference on Computational learning theory (1995) 311-320
DOI:	10.1145/225298.225336
Cite as:	arXiv:1911.05781 [cs.LG]
	(or arXiv:1911.05781v3 [cs.LG] for this version)

Submission history

From: Jonathan Baxter [view email]
[v1] Wed, 13 Nov 2019 19:48:11 GMT (31kb)
[v2] Wed, 20 Nov 2019 00:53:48 GMT (32kb)
[v3] Thu, 19 Dec 2019 13:45:56 GMT (32kb)

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Link back to: arXiv, form interface, contact.

> cs > arXiv:1911.05781

Download:

Current browse context:

Change to browse by:

References & Citations

DBLP - CS Bibliography

Bookmark

Computer Science > Machine Learning

Title: Learning Internal Representations (COLT 1995)

Submission history