September 21, 2015
Sampling Rate for Computing the Expression Threshold in a Gene Expression Signal
Gene expression data is provided through expensive assays such as DNA microarray, quantitative polymerase chain reactions (qPCR), or next generation sequencing experiments. The cost of such assays affect the quantity of data points needed for different studies on gene expression data so that analysis approaches suffer from the lack of sufficient number of samples. One of the fundamental studies on gene expression data is binarization which is base for developing Boolean networks which in turn are vital for various biology studies. This research aims to find an adequate number of samples needed particularly for binarization. An experimental approach is considered through which the data is binarized several times, first on the original data obtained from the assays then on the virtual data obtained by interpolating the original data in several levels. It is expected that the thresholds introduced by the binarization algorithm converge after a certain level of interpolation. The number of samples greater than which the thresholds converge introduces an efficient sampling rate. Parallel to the experimental approach a supporting mathematical approach will be conducted through which the gene expression data are seen as signals, and Nyquist sampling rate theorem will be applied to validate the resulting introduced sampling rate.
September 28, 2015
SmartStorage: A Deduplicated and Distributed Datastore
Datastores contain redundant copies of information, like identical files or sub-file regions, possibly stored on a single host, on a shared storage cluster, or backed-up to secondary storage. Deduplication storage systems take advantage of this redundancy to reduce the underlying space needed to contain the file system. The purpose of this work is to present the design and implementation of a distributed datastore using deduplication. Scalable, highly reliable distributed systems supporting data deduplication have recently become popular for storing backup and archival data. The concept of filerecipe is central to this approach. The filerecipe for a file is a synopsis that contains a list of data block identifiers: each block identifier is a cryptographic hash over the contents of the block. Once the data blocks identified in a recipe have been obtained, they can be combined as prescribed in the recipe to reconstruct the file. Files can be replaced by the corresponding filerecipes. Data deduplication systems discover redundancies between data blocks. Our approach divides data into 8KB chunks and identifies redundancies via fingerprints. This will improve the storage capacity by increasing the storage efficiency ratio (bytes of actual file / bytes used to store). We will reexamine traditional choices and explore new design points.We discuss some aspects of the design and implementation.
October 5, 2015
Unsupervised Outlier Detection using Bayesian Networks and Probabilistic Association Rules
Currently databases are constantly increasing in size and complexity, due to new data acquisition technologies. One challenge is how to detect outliers from these large data sets.
Outlier detection is crucial to extract valid information from the data, since dirty data can lead to wrong conclusions (Garbage in - Garbage out). A novel approach which combines the use of Bayesian network and probabilistic association rules is used for this task in an unsupervised approach. The Bayesian network organize information in order to capture both correlation and causality in the feature space, while the probabilistic association rules conditioned on the Bayesian network provide mechanisms for detect and explain outliers focus on two types of rules: (i) low support & high confidence and, (ii) high support &low confidence. In order to speed up the process of learning the Bayesian network and report outliers based on probabilistic association rules in large data sets, in this talk we explore the application of distributed computing through Map-Reduce paradigm, the results on benchmark data sets will demonstrate that the new data points which satisfy either one of the two rules conditioned on the Bayesian network are potential candidates to be outliers with high probabilities.
October 19, 2015
A Survey of Select Efficiency Enhancing Techniques in Microprocessor Design
Consumer demand for increased computational throughput and battery life in mobile devices is insatiable. In an effort to meet this demand, computer architects apply techniques such as power-gating, dynamic cache relocation and branch prediction. Computer architects also leverage functional units and accelerators designed to perform specialized tasks with minimal power. Nevertheless, studies show that optimization at the architectural level alone cannot sustainably meet the demand since circuitry and software choices play a significant role in overall power consumption. To continually meet consumer demand, an interdisciplinary approach at each design layer is recommended. In this talk, we will discuss recent energy-first design techniques (power-gating, dynamic cache relocation and branch prediction) and specialized units used to meet the demand for increased computational throughput and battery life.
October 26, 2015
Preliminary Results to the Application of Classical Pattern Recognition Techniques to Hyperspectral Images of Non-melanoma Skin Cancer Tissue Samples
Non-melanoma skin cancers are the most common of the human cancers. However, since the introduction of Mohs Micrographic Surgery (MMS) they have a high cure rate by means of excision. The procedure requires for the patient to submit himself or herself to skin excision and wait until the skin sample is frozen, sliced and stained before the surgeon can properly determine microscopically (histopathology) if another stage of excision is required to remove the complete mass. This procedure can become lengthy and uncomfortable since the patient is often left waiting while the histopathology is prepared. By analyzing the spectral characteristics of thick slabs of tissue we propose an alternative to histopathology. Fresh tissue samples are stained and analyzed by a polarized hyperspectral imaging system as a means of detecting malignant tissue. The detectors are the classical Euclidean Distance Classifier, Maximum Likelihood Classifier, Tanimoto Distance Classifier and the C-means with covariance variation to K-Means. Two separate experiments were performed. First, a number of tissue samples were labeled via histopathology and used to generate a sample base for two (2) different histopathological stains: Toluidine Blue and Methylene Blue. Second, a number of tissue samples where freshly imaged and processed via unsupervised methods this was done for samples stained with Tetracycline and unstained samples. The experiments yielded encouraging results. A significant portion of all cancerous pixels where correctly classified with the stained samples. The unsupervised classification yielded significant correlation with the histopathology. These results encourage further study in this are as well as the development of more expedient ways to capture, process and display the results as to further assist the surgeon in this procedure.
November 2, 2015
Copula Bayesian Network to Estimate Rainfall Using Radar and Satellite Data
Different types of measurement instruments precipitation phenomenon may help understanding the dependence of the variables used in the prediction of rain. The measurements come from a Doppler radar and meteorological satellite channels, which can be active or passive type. This talk explores the application and adaptation of CBN (Copula Bayesian Network) in the development of models for the spatial and temporal relationship of the rain under a joint probability function available using different variables such as reflectivity, temperature at the top of the cloud for sensing precipitation. One of the difficulties of such predictions is due to the low temporal resolution of passive satellite (2 times a day); while the radar measurements every 15 min approximately. The problem can be solved by creating a computer model of copula Bayesian networks, replenished with radar and satellite information signal channels. It is important to recognize that this type of model predictions of rain, can help prevent environmental disasters, possible structural damage and death of human beings.
November 9, 2015
Fully homomorphic encryption
One of the principal concerns that a potential client of cloud computing has is data security. What are the risks involved when you trust your data to the cloud? Cloud data is typically stored in a massive data store, such as Google's Bigtable and mingled with data of other clients. In order to keep your data private, you must encrypt it. But if the data is to be processed by the cloud, how can this take place without decrypting the data? This is a difficult problem that has been considered by many researchers. Here the challenge is how can data be manipulated or queried without revealing the data. Various solutions have been offered for various special cases. A major breakthrough occurred in June 2009, when IBM researcher Craig Gentry discovered a fully \homomorphic" encryption scheme. An encryption E is homomorphic with respect to an operation o, if E(x o y) = E(x) ● E(y) for some operation ● on the encrypted images. IBM's new scheme is fully homomorphic in the sense that one can evaluate circuits over encrypted data without decrypting. If this idea can be implemented in a practical way, it will completely solve the problem of processing encrypted data. However, this new scheme requires huge amounts of computational efforts and there are some doubts if it can be made practical.
Encryption systems that are currently in common use are homomorphic in some, but not all, operations. Greg Gentry presented an encryption scheme that is not only homomorphic in both addition and multiplication, but also secure. In this talk we describe the steps in constructing a fully homomorphic encryption system over the integers, we discuss its security, which is based on an “approximate” version of the greatest common divisor (GCD) problem, and we discuss the challenges in reducing the computational costs.
November 16, 2015
A block algebraic multilevel preconditioner for discontinuous Galerkin discretization
Jesus M. Cajigas
Algebraic multilevel preconditioners are a common and effective way to accelerate the convergence of linear solvers. A block algebraic multilevel technique for discontinuous Galerkin discretizations is discussed. The proposed preconditioner is tested with symmetric positive definite matrices obtained from the Local Discontinuous Galerkin spatial discretization applied to elliptic partial differential equations in three dimensional domains. Numerical experiments include matrices from high order approximations and domains with rough coefficients. Results show a drastic reduction of the iteration count as the number of cycles and smoothing sweeps are increased.