embedding based logistic regression-neural network logistic regression tensorflow

Keywords: Python Session network encoding

Inspiration - because recently we have been working on RNN-based NLP, where no matter what cell, lstm, GRU or cnn are embedding representations based on words; embdding of words is to represent each word as a vector, and then train the values of these vectors through bp, which is a wonderful idea, so I try to apply this idea to logistic regression.

Question - For logistic regression, many vectors are categorial. What if there are 1000 categories? Is it converted to one-hot vector of 1000*1? Method: embedding, each category is given a 10-dimensional vector, and then the traditional regression or neural network method is used.

Experiments - - 1: data list; data from kaggle, redhat project, interested in their own to see; 2: method; title is logical regression, but in essence it is neural network classification; but this problem is traditionally solved by logical regression, because it contains a lot of categorial data, and then label is 0 and 1, require classification; A logistic regression is very simple; but the problem here is that there is a group variable and a people vector in the data. There are about 3k + categories in the group and 180K + categories in the people. Obviously, it is not appropriate to convert them into dummy variables and do logistic regression. Here I mainly refer to the idea of embedding, and build two dictionaries in tensorflow, one people dictionary. A group dictionary, then go to the dictionary to return two 10-dimensional real vectors when training, which are the characteristics of people and group respectively; and then casually get a full connected layer and some activation functions, the effect is good, quickly converged to more than 90%; 3: effect; this data, I just want to experiment in tf.Se at the beginning. In the case of session (), how can batch read tfrecords data, because if tfrecords data is read, the whole data does not need to be load ed into memory; previously, tfrecords were read by estimator method, but there seems to be no good solution after session; the effect is good, mainly because it feels that embedding method can be used to solve many kinds of problems later. ;

#encoding=utf-8
import numpy as np 
import tensorflow as tf 
import pickle
import random 
model_dir = '/home/yanjianfeng/kaggle/data/model_dir/'


people_dic, group_dic, dic = pickle.load(open('/home/yanjianfeng/kaggle/data/data.dump', 'r'))
def create_train_op(loss):
    train_op = tf.contrib.layers.optimize_loss(loss = loss, 
        global_step = tf.contrib.framework.get_global_step(), 
        learning_rate = 0.1, 
        clip_gradients = 10.0, 
        optimizer = "Adam")
    return train_op 

def create_input():
    random_id = random.randint(0, len(dic['outcome'])-2049)
    keys = dic.keys() 
    data = {}
    for k in keys:
        data[k] = dic[k][random_id: random_id+2048]
    return data


# It's better not to put the body part in the function. It's not easy to extract a specific value.
# Or put the main body directly tf.Session It's more accommodating. It's probably such a model.


global_step = tf.Variable(0, name = 'global_step', trainable=False)

people_id = tf.placeholder("int64", [None])
group = tf.placeholder('int64', [None])
time = tf.placeholder('int64', [None])
peofea = tf.placeholder('int64', [None, 262])
rowfea = tf.placeholder('int64', [None, 174])
outcome = tf.placeholder("int64", [None])

name_embed = tf.get_variable('names', shape = [189120, 10])
group_embed = tf.get_variable('groups', shape = [35000, 10])
name_ = tf.nn.embedding_lookup(name_embed, people_id)
group_ = tf.nn.embedding_lookup(group_embed, group)

name_w = tf.get_variable('name_w', shape = [10, 2])
group_w = tf.get_variable('group_w', shape = [10, 5])

name_outcome = tf.matmul(name_, name_w)
group_outcome = tf.matmul(group_, group_w)

w_1 = tf.get_variable('w_1', shape = [262, 10])
w_2 = tf.get_variable('w_2', shape = [174, 10])
w_3 = tf.get_variable('w_3', shape = [1])

peofea_outcome = tf.matmul(tf.to_float(peofea), w_1)
rowfea_outcome = tf.matmul(tf.to_float(rowfea), w_2)

time_outcome = tf.mul(tf.to_float(time), w_3)
time_outcome = tf.expand_dims(time_outcome, -1)

name_outcome = tf.sigmoid(name_outcome)
group_outcome = tf.sigmoid(group_outcome)
peofea_outcome = tf.sigmoid(peofea_outcome)
rowfea_outcome = tf.sigmoid(rowfea_outcome)
time_outcome = tf.sigmoid(time_outcome)

x = tf.concat(1, [name_outcome, group_outcome, peofea_outcome, rowfea_outcome, time_outcome])

w_f = tf.get_variable('w_f', shape = [28, 28])
b = tf.get_variable('b', shape = [1])
w_f_2 = tf.get_variable('w_f_2', shape = [28, 1])

pred = tf.sigmoid(tf.matmul(x, w_f)) + b 
pred = tf.matmul(pred, w_f_2)

y = tf.expand_dims(tf.to_float(outcome), -1)

prob = tf.sigmoid(pred)
prob = tf.to_float(tf.greater(prob, 0.5))
c = tf.reduce_mean(tf.to_float(tf.equal(prob, y)))

loss = tf.nn.sigmoid_cross_entropy_with_logits(pred, y)
loss = tf.reduce_mean(loss)
train_op = create_train_op(loss)



# The order here is important, if you use it first saver,Then will save To the very beginning?
saver = tf.train.Saver()
with tf.Session() as sess:

    sess.run(tf.initialize_all_variables())
    ckpt = tf.train.get_checkpoint_state(model_dir)
    if ckpt and ckpt.model_checkpoint_path:
        print 'the model being restored is '
        print ckpt.model_checkpoint_path 
        saver.restore(sess, ckpt.model_checkpoint_path)
        print 'sucesssfully restored the session'

    count = global_step.eval()

    for i in range(0, 10000):
        data = create_input()
        l, _ , c_ = sess.run([loss, train_op, c], feed_dict = {people_id: data['people_id'],
            group: data['group'],
            time: data['time'],
            peofea: data['people_features'],
            rowfea: data['row_features'],
            outcome: data['outcome']})
        print 'the loss\t' + str(l) + '\t\tthe count\t' + str(c_)
        global_step.assign(count).eval()
        saver.save(sess, model_dir + 'model.ckpt', global_step = global_step)
        count += 1 

Posted by sanam on Mon, 24 Dec 2018 19:30:06 -0800