CU-SeeMe Desktop VideoConferencing Software
by Tim Dorcey, Cornell University
From Connexions, Volume 9, No.3, March 1995
Introduction CU-SeeMe is a desktop videoconferencing system designed for use on the Internet or other IP networks. It runs on Macintosh and Windows platforms and requires no special equipment for video reception beyond a network connection and gray-scale monitor. Video transmission requires a camera and digitizer, a combined version of which can be purchased for the Macintosh for under $100. CU-SeeMe video is represented on a 16-level gray scale, and is provided in either of 2 resolutions: 320 x 240 pixels (half diameter of NTSC television) or 160 x 120 pixels. At this writing, audio is available in the Macintosh version only, (ed. note: audio for Windows was released August, 1995) with audio processing adapted from the"Maven" program, written by Charlie Kline at the University of Illinois. When network conditions or equipment deficiencies prohibit reliable audio, ordinary telephone connections can be employed. In addition to basic audio/video services, CU-SeeMe offers crude white board capabilities in the form of a "slide window" that transmits full-size 8-bit gray scale still images and allows for remote pointer control. A plug-in architecture has also been developed to allow 3rd parties to write binary modules which extend the capabilities of CU-SeeMe. Two-party CU-SeeMe conferences can be initiated by one participant connecting directly to the other, whereas larger conferences require that each participant connect to a "CU-SeeMe reflector," a unix computer running CU-SeeMe reflector software that serves to replicate and re-distribute the packet streams. The main objective in the development of CU-SeeMe was to produce an inexpensive videoconferencing tool that would be useable today. As well as providing direct benefit to its users, we expected that valuable lessons could be learned about how videoconferencing actually works in practice, how the experience should be organized, what features are necessary to support multi-party conferencing, and so on. While others worked to advance the state of the art in video compression, high-speed networking, and other low-level technologies necessary to support high quality videoconferencing, we hoped to facilitate the accumulation of experience that would provide impetus for those efforts and guide their direction. Similar efforts have focused on unix workstations, for which several tools are currently available, including "nv" [1], "ivs" [2] and "vic" [3]. In fact, it was Paul Milazzo's [4] demonstration of such a tool in 1991 that inspired development of CU-SeeMe. However, it is our belief that the value of a communication tool is largely determined by the number of people that can be reached with it. We sought to increase the accessibility of videoconferencing by focusing on low-end, widely available, computing platforms. Currently, CU-SeeMe can be found in places ranging from the grade school to the national laboratories--often with a connection between them. It has appeared in over 40 countries [5] and on every continent--including Antartica [6]. This article presents a brief overview of two central components of the CU-SeeMe software: Conference Control and Video Encoding. Conference Control Each participant in a CU-SeeMe conference periodically transmits a single packet that advertises their interests with respect to all of the other participants. These advertisements are termed "OpenContinue" packets, in recognition of the fact that in a connectionless protocol the information necessary to open a connection is no different than that used to continue it. The OpenContinue packet consists of a standard header that is common to all CU-SeeMe packets, followed by a section of general information about the issuing participant. Then, for each other participant that the sender is aware of, follows a collection of variables that express the sender's interests with respect to that other participant (e.g., I want their video, I want their audio, I want their slides, etc.). Reflectors examine OpenContinue packets to develope source specific distribution lists for the various media involved in a conference and then forward them to all other participants. Because the protocol requires each participant to process dynamic status information for every other conference participant, it does not scale well to large conferences, say larger than about 30. However, it does provide considerable control, in a robust fashion, over the details of smaller conferences. Furthermore, various possibilities, beyond the scope of this discussion, exist for extending the protocol to larger conferences. The primary motivation for developing the reflector software was the absence of multicast capabilities on the Macintosh. We have therefore been careful about extending its role beyond the replication and distribution of packets, allowing it to add value where it can, but avoiding dependence on it. We have, however, increasingly come to appreciate the degree to which the reflector architecture allows for fine tuning of the data streams sent to each recipient, and we do not expect this to become any less important when multicast becomes more widely available. Video Encoding The predominant objective here was to devise algorithms that would be fast enough for real time videoconferencing on the typical Macintosh platforms that were available in mid-1992, which mainly consisted of 68020 and 68030 based machines. The decoding algorithm, in particular, needed to be extremely efficient in order to support multiple incoming video streams. The main technique for achieving these goals was to begin with a massive decimation of the input video signal, and then to process what remained in a manner that took maximal advantage of the capabilities of the target processors, as described below. For simplicity, discussion will focus on the smaller size video format, which has become most popular in practice. Video processing proceeds in 3 basic steps: 1) decimation 2) change detection and 3) spatial compression. The first step in the video encoding is to decimate the captured 640 x 480 pixel video frame down to 160 x 120, with each pixel represented on a 4-bit gray scale. In comparison to full size, 16-bit color, this represents a 64:1 reduction in the amount of data to be handled by subsequent processing. The user is provided with brightness and contrast controls to adjust the mapping of input intensities to the 16 level gray-scale. With proper adjustment and reasonable lighting conditions, surprisingly good picture quality can be achieved. Next, the video frame is subdivided into 8x8 pixel squares, and a square is selected for transmission if it differs sufficiently from the version of it that was transmitted most recently. The index used to measure square similarity is the sum of the absolute value of all 64 pixel differences, with an additional multiplicative penalty for pixel differences that occur nearby to one another. Inclusion of the multiplicative penalty was based on the assumption that changes in adjacent pixels are more visually significant than isolated changes. Its exact form was dictated by computational convenience, devised so as not to introduce any additional computional burden except during the initialization of a look-up table at program load time. To account for the possibilty that updates may be lost in transit, transmission is also triggered if a square has not been transmitted for a specified number of frames (refresh interval). This ensures that a lost update will not corrupt the image into the indefinite future. Once a square has been selected for transmission, a locally developed lossless compression algorithm is applied. The most interesting feature of this algorithm is the degree of parallelism it achieves by manipulating rows of 8 4-bit pixels as 32-bit words. This allows for high-speed performance on a 32-bit processor, but also complicates exposition of the algorithm. The basic idea is that a square row is often similar to the row above it, and, when it is different, it is likely to be different in a consistent way across columns. Letting r[i] represent a 32-bit word containing the ith row of pixels in a square, compression is based upon the representation: r[i] = r[i-1] + d[i], where d[i] is constructed from either a 4, 12, 20 or 36 bit code. If d[i] is thought of as being composed of 8 4-bit pixel differences, then spatial redundancy in the vertical direction suggests that the differences will all be near to 0, whereas correlation in the horizontal direction suggests that they will be near in value to each other. Under those assumptions, the sorts of d[i] that are most likely to occur can be predicted and a scheme devised to represent them using a relatively small number of bits. Roughly speaking, for each d[i], a 4-bit code is used to specify a) a common component of all pixel differences (restricted to the range [-2,2]) and b) whether there are 0,8,16 or 32 bits of additional data to represent deviations around that constant component. In reality, of course, d[i] is not composed of individual pixel differences, since carry bits can occur in the 32-bit arithmetic, but the technique still seems to work reasonably well, achieving around 40% compression (compressed size is approximately 60% of original). Although a 40% compression ratio may not appear impressive, recall that this is for images that have already undergone a 64:1 decimation from the original, and that the information discarded at the outset was that most suitable for compression. The CU-SeeMe video encoding has proven to be surprisingly robust against packet loss when the subject matter is a typical talking head. Often, the only observable effect of high packet loss is a reduction in frame rate. This can be explained as follows. After decimation and compression, it is almost always the case that the information required to update a frame will fit within a single (less than 1500 byte) packet. Hence a lost packet corresponds to a lost frame update, rather than a partial frame update. Secondly, when the subject is a talking head, most squares are either changing every frame or not at all. Hence a square update that is lost was likely to be replaced in the next frame anyway. Only when a square changes, but then does not change in the next frame, will corruption occur. This suggests a simple method for embedding several frame rates within a single video stream. Say, every 3rd frame, transmit a square if it is different than it was in the preceding frame OR if it is different than it was 3 frames ago. Recipients who desire a slower frame rate could accept every 3rd update and still get a clean video image. The observations regarding packet loss suggest that this would not introduce a great deal of additional traffic i.e., a square that differs from 3 frames ago is also likely to differ from the preceding frame. Some variation on this scheme will be implemented in an upcoming version of CU-SeeMe to better support conferences involving participants with differing network capacity. Conclusions CU-SeeMe employes a conference control protocol that has proven to be quite robust and allows for the expression of detailed state regarding the relations of each conference participant to each other participant. In conjunction with the reflector software it allows for customized distribution of conference media, so that nothing is transmitted unless it will be used. The protocol is limited in the size of a conference which it can serve, but it can be extended. CU-SeeMe video is encoded in an ad hoc format that was designed for a particular family of desktop machine that were widespread in mid 1992. What it lacks in mathematical elegance, it makes up for in quickness. As computing power increases, it will eventually become obsolete. Nonetheless, it played an essential role in making CU-SeeMe interesting enough to warrant further work, and it, or its derivatives, will continue to play an important role for some time. CU-SeeMe can be obtained from ftp://cu-seeme.cornell.edu/pub/video. Notes and References [1] Frederick, Ron. "Experiences with real-time software video compression." In Proceedings of the Sixth International Workshop on Packet Video. Portland, 1994. [2] Turletti, Thierry. "The INRIA Videoconferencing System (IVS)." Connexions, Vol. 8, no. 10, 1994. [3] McCanne, Steve & Jacobsen, Van. vic man page. Available at ftp://ftp.ee.lbl.com/conferencing/vic as of December, 1994. [4] Milazzo, Paul. Informal demonstration of dvc at the December 1991 meeting of the IETF in Santa Fe. [5] This estimate is based on subscribers to the CU-SeeMe mailing list as of January 1995. [6] "Tomorrow's TV Today." Time, October 10, 1994, p. 24.